LiDAR & Depth Cameras for Robots: The Ultimate Guide
A working engineer's guide to 3D perception for robots: LiDAR ranging and architectures (mechanical, MEMS, flash, FMCW), stereo vs structured light vs ToF depth cameras, the numbers that matter, point clouds, SLAM, and how to pick a sensor.
A camera tells you where something is in the image. A 3D sensor tells you where something is in the world. That difference — pixels to metres — is the entire reason a robot can drive through a doorway it has never seen, pick a part out of a bin, or stop before it amputates someone's foot. Take the depth away and you are back to a system that recognizes a coffee mug beautifully and then drives its gripper straight through the table.
This guide is about the two sensor families that give robots that metric, three-dimensional view of the world: LiDAR and depth cameras. We will go through how the ranging actually works (time-of-flight, triangulation, frequency-modulated continuous wave), why a 905 nm laser and a 1550 nm laser are not interchangeable, how mechanical spinners differ from MEMS and flash and FMCW, when a stereo pair beats a structured-light projector, and what every one of those technologies does the moment you take it outside into direct sun. Then we get concrete about real hardware — Ouster, Livox, Hesai, Slamtec, Intel RealSense, Stereolabs ZED, Microsoft/Orbbec Femto, Luxonis OAK, Basler — and how to choose.
The take: there is no "best" 3D sensor, there is only the sensor matched to your range, your lighting, your accuracy budget, and your compute budget — and most failed perception stacks are a sensor-choice mistake made before a single line of code was written. Indoors at 0.5–6 m you almost always want a depth camera; outdoors past 10 m in sun you almost always want LiDAR; the interesting engineering is in the overlap, and in fusing the two so each covers the other's blind spots.
Companion reading: robot sensors, machine vision, mobile robots (AMR/AGV), and ROS 2.
Table of contents
- Key takeaways
- Why robots need 3D perception
- LiDAR fundamentals: how ranging actually works
- LiDAR architectures: spinning, MEMS, flash, FMCW
- Depth-camera technologies head-to-head
- Stereo vision deep-dive
- Structured light
- Time-of-flight cameras
- The numbers that matter
- Point clouds and data
- Where each sensor fits
- SLAM and sensor fusion
- Selecting a 3D sensor
- Frequently asked questions
Why robots need 3D perception
A robot has to answer three questions before it does anything physical: Where am I? What is around me? Where exactly is the thing I want to touch? All three are geometry questions, and geometry needs depth.
Localization and navigation need depth because a planner reasons in metres, not pixels. An obstacle two pixels tall could be a speck of dust on the lens or a forklift 30 m away; only range disambiguates. Manipulation needs depth because a grasp pose is a 6-DoF transform in the robot's frame — you cannot servo a gripper to a 2D bounding box. And safety needs depth because the entire concept of a "protective stop at 0.8 m" is meaningless without a metric distance.
Rule of thumb: if a downstream module reasons in metres — planning, grasping, collision checking, safety zones — it needs a sensor that measures metres, not one that infers them from appearance.
The exteroception family
3D sensing is one branch of a robot's exteroception — its sensing of the external world. The full family includes contact and force sensors, proximity sensors, 2D cameras, radar, sonar, and the 3D sensors covered here. The robot sensors guide lays out the whole taxonomy; this article zooms into the depth-producing members.
The reason 3D sensors get their own deep treatment is that depth is uniquely hard and uniquely valuable. A 2D camera is a passive, cheap, dense, high-resolution sensor — and it throws away the one dimension a robot's body lives in. Recovering that dimension is what LiDAR and depth cameras exist to do, and they do it by physically different tricks, each with a different failure mode.
Active vs passive sensing
The deepest split is active versus passive. A passive sensor (a plain camera, a stereo pair with no projector) only collects ambient light. An active sensor (LiDAR, structured light, ToF, active stereo) emits its own light and measures what comes back.
Passive sensing is cheap, silent on the spectrum, and works at any range the optics allow — but it fails where the scene gives it nothing to work with (a blank white wall, a dark room). Active sensing carries its own illumination, so it works in the dark and on featureless surfaces — but it costs power, can interfere with copies of itself, and fights a losing battle against the sun outdoors. Almost every trade-off in this guide is a consequence of that one split.
LiDAR fundamentals: how ranging actually works
LiDAR — Light Detection and Ranging — measures distance by timing or phase-tracking light it emits. Strip away the spinning and the optics and a LiDAR is a laser, a photodetector, and a very fast clock.
Direct time-of-flight (dToF)
The textbook method. Fire a short laser pulse, start a timer, wait for the reflection, stop the timer. Distance is half the round trip:
Range: R = (c · t) / 2
c = speed of light ≈ 2.998 × 10⁸ m/s
t = round-trip time of flight
Example: a target at 100 m
round-trip distance = 200 m
t = 200 / 2.998e8 ≈ 667 ns
Timing resolution needed for 1 cm range resolution:
Δt = 2 · ΔR / c = 2 · 0.01 / 2.998e8 ≈ 67 ps
That 67 ps figure is the whole engineering challenge of dToF: to resolve centimetres you need picosecond-class timing electronics, typically a time-to-digital converter (TDC) and avalanche photodiodes (APDs) or single-photon avalanche diodes (SPADs). It is also why LiDAR is fundamentally an interval measurement, not an intensity one — it does not care how bright the return is, only when it arrives, which is why it is far more robust to surface reflectivity than a camera.
Amplitude-modulated continuous wave (AMCW / phase)
Cheaper short-range LiDARs and most iToF cameras instead modulate the laser amplitude as a continuous sine wave and measure the phase shift between emitted and received light. Phase wraps every half wavelength of the modulation, which sets an unambiguous range:
Phase ToF: R = (c / (4π·f_mod)) · φ
f_mod = modulation frequency
φ = measured phase shift (radians)
Unambiguous range: R_max = c / (2 · f_mod)
f_mod = 20 MHz → R_max = 7.5 m
f_mod = 100 MHz → R_max = 1.5 m
Higher modulation frequency buys precision but shrinks the unambiguous range — beyond R_max the phase wraps and a 9 m target reads as 1.5 m. Multi-frequency schemes (combining, say, 20 MHz and 80 MHz) recover a longer unambiguous range while keeping precision.
FMCW: frequency-modulated continuous wave
The newest production approach. Instead of pulses, FMCW sweeps the laser frequency (a chirp) and mixes the return with the outgoing light. The beat frequency encodes range, and any Doppler shift encodes radial velocity — you get per-point speed for free. FMCW is coherent detection, so it is nearly immune to sunlight and to other LiDARs (only light correlated with its own chirp produces a beat). More on this below; it is the headline architecture of Aeva and the long-range automotive players.
The laser and the detector
The emitter is a laser diode — edge-emitting or, increasingly, a VCSEL (vertical-cavity surface-emitting laser) array for flash and solid-state units. The detector is a photodiode: a PIN diode for cheap close range, an APD for sensitivity, or a SPAD array for photon-counting dToF (the technology behind Ouster's digital LiDAR and many automotive flash units).
Rule of thumb: SPAD/CMOS digital LiDAR trades the analog finesse of a tuned APD for the scaling, calibration stability, and cost curve of a semiconductor process. That bet is why Ouster's per-channel cost fell while channel counts climbed.
905 nm vs 1550 nm and eye safety
Two wavelengths dominate, and the choice cascades through the whole sensor.
905 nm sits at the edge of silicon's sensitivity, so it uses cheap silicon APDs/SPADs — the same process economics as camera sensors. The catch is eye safety: 905 nm passes through the eye and focuses on the retina, so Class 1 eye-safe limits cap the optical power, which caps range, especially against low-reflectivity targets in bright sun.
1550 nm is strongly absorbed by water in the cornea and never reaches the retina, so eye-safe limits allow far higher optical power — roughly two orders of magnitude more — translating to longer range and better sun robustness. The price: 1550 nm is invisible to silicon, so you need InGaAs detectors and fibre-laser or specialized diode sources, which are expensive. This is the classic automotive long-range trade: 1550 nm for the 200 m+ highway sensor, 905 nm for everything cost-sensitive.
Rule of thumb: 905 nm is the cost-and-volume wavelength; 1550 nm is the range-and-sun wavelength. If your spec sheet brags about 250 m at 10% reflectivity, it is almost certainly 1550 nm.
Beam, divergence, and the "one point is a cone" problem
A laser beam is not a line; it is a cone with some divergence (often 1–5 mrad). At range, that cone has real width — at 1 mrad, a beam is ~10 cm wide at 100 m. This sets your effective lateral resolution and means a single "point" is actually the centroid of whatever the beam footprint hit. It also produces edge artefacts: a beam straddling a near and a far object returns two echoes, which is why multi-return LiDAR (reporting the strongest, last, or several returns) matters for foliage, rain, and dust.
LiDAR architectures: spinning, MEMS, flash, FMCW
Having one laser-and-detector pair only measures one direction. To build a 2D or 3D picture you must steer that beam (or many beams) across the scene. How you steer it is the architecture, and it dictates field of view, durability, cost, and resolution.
Mechanical spinning
The original and still the workhorse. A stack of laser/detector pairs (the "channels" or "lines") rotates 360° on a motor — 10–20 Hz typically. Velodyne pioneered it; Ouster, Hesai, and RoboSense ship modern versions. You get a full 360° horizontal field of view and a vertical FoV set by the channel count and spacing (e.g. 32 or 64 or 128 lines spanning ~22–45° vertical).
Strengths: full surround coverage, mature, well-understood point clouds. Weaknesses: a spinning motor is a wear item and a vibration source; the units are tall pucks; and per-unit cost historically ran into the thousands. The big shift of the last few years is digital spinning LiDAR (Ouster's SPAD-on-CMOS), which keeps the spin but replaces racks of analog channels with a semiconductor sensor — cheaper, more uniform, easier to calibrate.
Solid-state and MEMS
To kill the big spinning motor, MEMS LiDAR steers the beam with a tiny micro-mirror that tilts on silicon hinges. There is still a moving part, but it is microscopic and sealed. The trade is field of view: a MEMS mirror sweeps a forward cone (often ~120° horizontal, ~25° vertical), not 360°. Livox's risley-prism units (e.g. the Mid-360, Avia, HAP) and many automotive forward-looking units live here. They are cheaper, more rugged, and lower profile — at the cost of needing several to cover the surround a single spinner gives you.
Livox in particular uses a non-repeating scan pattern: instead of fixed horizontal lines, the beam traces a flower-like pattern that fills in coverage the longer you dwell. This gives very dense clouds with integration time but means a single-frame snapshot is sparser and non-uniform — great for mapping, more awkward for instantaneous obstacle detection.
Flash LiDAR
No scanning at all. A single wide laser pulse floods the whole scene (like a camera flash) and a 2D SPAD/APD detector array times the return at every pixel simultaneously. This is mechanically bulletproof — zero moving parts — and captures a full frame in one shot, ideal for fast-moving scenes. The catch is the range-resolution-FoV triangle: spreading finite laser energy over a wide field starves each pixel, so flash units are short-to-medium range or narrow FoV. They shine as close-range automotive corner sensors and on spacecraft (where they do terrain-relative navigation and docking).
FMCW (and the velocity dividend)
FMCW, introduced above, is as much an architecture as a ranging method because coherent detection changes the whole sensor design. Every point carries instantaneous radial velocity (Doppler), which is transformative for tracking moving objects and for ego-motion estimation. It is immune to sun and to other LiDARs. The downsides are cost and complexity — coherent optics and 1550 nm components are not cheap — and historically lower point rates, though that gap is closing. Aeva and a handful of automotive suppliers lead here.
2D vs 3D, and channel count
A 2D LiDAR has a single beam swept in one plane — it returns a slice (a ring of ranges at one height). This is the bread-and-butter indoor AMR/safety sensor: Slacmtec/RPLidar, SICK, Hokuyo. Cheap, low data rate, perfect for floor-level obstacle avoidance and 2D SLAM.
A 3D LiDAR stacks many beams (channels/lines) vertically — 16, 32, 64, 128 — to sample a volume. More channels means finer vertical resolution and denser clouds, and roughly linear cost and data-rate scaling.
| Architecture | Moving parts | Typical FoV | Range | Velocity? | Relative cost | Best for |
|---|---|---|---|---|---|---|
| Mechanical spinning | Motor (macro) | 360° H × 22–45° V | 50–250 m | No | $$–$$$ | Surround perception, AVs, mapping |
| Digital spinning (SPAD) | Motor (macro) | 360° H × 22–45° V | 50–200 m | No | $$ | Modern surround, lower cost/channel |
| MEMS / solid-state | Micro-mirror | ~70–120° H × ~25° V | 50–300 m | No | $–$$ | Forward-looking, rugged, low profile |
| Flash | None | ~30–120° H, narrow | 10–100 m | No | $$ | Close range, fast scenes, space |
| FMCW | Varies | ~60–120° forward | 200–500 m | Yes | $$$$ | Long-range AV, ego-motion, interference-heavy |
| 2D scanning | Motor (small) | 270–360° single plane | 8–40 m | No | $ | Indoor AMR, safety, 2D SLAM |
Depth-camera technologies head-to-head
A depth camera produces a per-pixel range image — a "depth map" — that pairs with the RGB image. There are three fundamentally different ways to compute that depth, and confusingly the marketing for all three says "3D camera."
Stereo vision uses two cameras a fixed distance apart and triangulates depth from the disparity between the two views, exactly as human binocular vision does. Active stereo adds an infrared projector that throws texture onto blank surfaces so the matcher always has something to lock onto (Intel RealSense D400 series, Stereolabs ZED works passively).
Structured light projects a known pattern (dots, stripes, or a coded sequence) and computes depth from how the pattern deforms over the scene's geometry. The original Microsoft Kinect (v1) and Orbbec/PrimeSense sensors are the canonical examples. It is extremely accurate at close range and helpless in sunlight.
Time-of-flight (ToF) cameras put a flash-LiDAR-like principle into a camera: an IR emitter floods the scene and a special sensor measures round-trip time (dToF) or phase (iToF) at every pixel. The Microsoft Azure Kinect and its successor the Orbbec Femto are iToF; some automotive and phone sensors are dToF (with SPAD arrays).
| Property | Stereo (passive/active) | Structured light | ToF (iToF/dToF) |
|---|---|---|---|
| Principle | Triangulation from disparity | Pattern deformation | Light round-trip time/phase |
| Active light? | Optional (active stereo) | Yes (IR pattern) | Yes (IR flood) |
| Close-range accuracy | Good | Excellent (sub-mm to mm) | Good |
| Long-range scaling | Best (widen baseline) | Poor (pattern fades) | Moderate |
| Sunlight outdoors | Works (passive especially) | Fails | Degrades badly |
| Featureless surfaces | Fails (passive); OK (active) | Works | Works |
| Frame rate | High (limited by matching) | Moderate | High |
| Resolution | High (= camera sensor) | High | Lower (sensor-limited) |
| Multipath / scattering | No | Some | Yes (its worst flaw) |
| Typical robotics use | Outdoor + indoor, AMR, AGV | Bin-picking, scanning, close manipulation | Indoor mapping, people, gestures |
| Example products | RealSense D455, ZED 2i, OAK-D | Orbbec, Photoneo, older Kinect v1 | Azure Kinect, Orbbec Femto |
The one-line summary: stereo for outdoors and range, structured light for close-range accuracy, ToF for fast dense indoor depth. The rest of this guide explains why each is true and where each breaks.
Stereo vision deep-dive
Stereo is the most camera-like depth technology, which is exactly why robotics people reach for it first: it is passive, uses ordinary image sensors, scales to long range, and works in sunlight.
Disparity and the depth equation
Two cameras separated by a baseline B see the same point at slightly different horizontal pixel positions. That difference is the disparity d. Depth follows from similar triangles:
Stereo depth: Z = (f · B) / d
Z = depth (m)
f = focal length (pixels)
B = baseline (m)
d = disparity (pixels)
Example: f = 700 px, B = 0.12 m (ZED 2i-ish)
d = 40 px → Z = 700 · 0.12 / 40 = 2.10 m
d = 10 px → Z = 700 · 0.12 / 10 = 8.40 m
d = 4 px → Z = 700 · 0.12 / 4 = 21.0 m
Notice that disparity falls off fast with distance: far objects have tiny disparity, and at some point the disparity drops below one pixel and you simply cannot measure it. That is the stereo range ceiling.
Why error grows with the square of range
Differentiate the depth equation and you get the single most important fact about stereo:
Depth error: ΔZ ≈ (Z² / (f · B)) · Δd
Δd = disparity matching error (≈ 0.1–0.5 px for good matchers)
Example: f = 700 px, B = 0.12 m, Δd = 0.2 px
at Z = 2 m : ΔZ ≈ (4 / 84) · 0.2 ≈ 0.0095 m (~1 cm)
at Z = 8 m : ΔZ ≈ (64 / 84) · 0.2 ≈ 0.152 m (~15 cm)
at Z = 20 m: ΔZ ≈ (400 / 84) · 0.2 ≈ 0.95 m (~1 m)
Depth error scales with Z². Go twice as far and your error quadruples. This is not a defect to be tuned away — it is geometry — and it dictates how you size a stereo rig: to push usable range out, you widen the baseline B or lengthen the focal length f (narrower FoV). A robot that needs accurate depth at 15 m needs a wide-baseline rig (the ZED 2i is 120 mm; long-range survey rigs go to a metre or more), not a 50 mm webcam-style pair.
Rule of thumb: stereo accuracy is set before runtime by baseline and focal length. No matter how good your matcher is,
ΔZ ∝ Z² / (f·B). Choose the rig for the range you need.
Calibration
Stereo lives and dies on calibration. You need each camera's intrinsics (focal length, principal point, distortion) and the extrinsics between them (the exact relative pose), then you rectify so corresponding points lie on the same image row — which turns the 2D match into a 1D search and is what makes real-time stereo feasible. A rig knocked out of calibration by a thermal cycle or a bump produces depth that is confidently, smoothly wrong. Factory-calibrated, rigid-baseline modules (RealSense, ZED, OAK-D) exist precisely so you do not hand-calibrate two loose cameras and chase drift forever.
The texture problem and active IR
Passive stereo needs texture to match — distinct features in both images. Point it at a blank white wall, a glossy panel, or a dim corridor and the matcher has nothing to correlate, so depth comes back full of holes. The fix is active stereo: an IR projector (a static dot pattern) sprays artificial texture onto the scene. Crucially the matcher does not need to decode the pattern (that is structured light's job) — it just needs the extra contrast. Intel RealSense D400 series is the canonical active-stereo line: it works in the dark, on blank walls, and still works in sunlight because if there is enough natural texture it falls back to passive matching. That dual nature is why active stereo is the most versatile indoor/outdoor depth camera family.
Structured light
Structured light projects a known, coded pattern — stripes, a pseudo-random dot cloud, or a temporal sequence of patterns — and recovers depth from how that pattern bends over the scene. Because the pattern is known, a single matched feature gives an absolute, high-precision depth, which is why structured light owns the close-range accuracy crown.
How it achieves accuracy
The geometry is triangulation again (projector and camera form the "stereo" pair, one of them replaced by a light source), but the known pattern removes the matching ambiguity that limits passive stereo. With temporally coded patterns (project N shifted patterns, decode per-pixel phase) you can hit sub-millimetre depth precision at 0.3–1 m. This is why industrial 3D scanners and high-end bin-picking sensors (Photoneo PhoXi, Zivid) are structured-light: when you need to find a 2 mm chamfer on a part in a bin, nothing else is this precise.
Why it fails in sunlight
The projected pattern is a few milliwatts of IR. Direct sunlight delivers roughly 1000 W/m² across the spectrum, a chunk of it in the near-IR band the sensor uses. The sun simply overwhelms the projected pattern's contrast — the camera sees sun-flooded pixels, the code is unreadable, and depth collapses. No amount of clever coding beats a four-orders-of-magnitude irradiance gap. Structured light is therefore an indoor technology, full stop. It also degrades with multiple units in the same space (patterns interfere) unless they are time-multiplexed or use distinct codes.
Single-shot vs multi-shot
Multi-shot (temporal coding) is the most accurate but needs a static scene during capture — motion smears the code. Single-shot (a spatially coded pattern decoded from one frame, like Kinect v1's dot cloud) tolerates motion and runs at video rate but is less precise. Choose by whether your scene holds still: a scanner on a static part bin can multi-shot; a sensor on a moving conveyor must single-shot.
Time-of-flight cameras
A ToF camera is, loosely, a flash LiDAR packaged as a camera: an IR emitter floods the whole scene and a specialized 2D sensor measures the round trip at every pixel at once. The result is a dense depth image at high frame rate with no baseline-dependent error — depth is measured directly, not triangulated, so accuracy does not blow up with Z² the way stereo does.
iToF vs dToF
Indirect ToF (iToF) modulates the emitter as a continuous wave and measures phase shift per pixel (the AMCW math from the LiDAR section). It is the mainstream camera approach — Microsoft Azure Kinect and Orbbec Femto are iToF — giving good resolution and precision indoors at 0.5–5 m. Its weaknesses are phase wrapping (handled with multi-frequency) and sensitivity to multipath.
Direct ToF (dToF) times individual photons with SPAD arrays, exactly like dToF LiDAR. It is more robust to multipath and ambient light and scales to longer range, but historically at lower pixel resolution. It is the technology in phone LiDAR sensors and an increasing share of automotive flash units. The lines are blurring as SPAD pixel counts climb.
Multipath: the ToF sensor's signature failure
ToF's worst enemy is multipath interference. The emitted light does not only travel straight to a surface and back — it also bounces off other surfaces and arrives late, corrupting the phase/time measurement. The textbook case is a concave corner: light bounces wall-to-wall before returning, and the corner reads as rounded or pushed back. Shiny floors, retroreflectors, and translucent objects produce similar errors. This is intrinsic to flood illumination and is the reason a structured-light or stereo sensor can beat a ToF sensor on a geometrically tricky scene even when the ToF sensor has better nominal precision.
Ambient light, resolution, and frame rate
ToF cameras compete with ambient IR. Indoors they are excellent; in direct sun the IR background eats dynamic range and depth degrades sharply (better than structured light, worse than passive stereo). Resolution is sensor-limited and historically lower than RGB — the Azure Kinect's depth sensor runs up to 1024×1024 in narrow FoV mode, 640×576 wide — but frame rates are high (30 fps typical, sometimes more) and latency is low, which is why ToF wins for gesture, people-tracking, and fast indoor mapping.
ToF range from phase (iToF): R = (c / (4π·f_mod)) · φ
ToF range from time (dToF): R = (c · t) / 2
Frame-to-depth budget at 30 fps:
per-frame time = 1/30 s ≈ 33 ms
iToF often captures multiple sub-frames (phase steps) within that window
→ fast motion within the 33 ms smears depth ("motion blur" in Z)
The numbers that matter
Spec sheets are written to flatter. Here is the engineer's checklist — the parameters that actually decide whether a sensor works in your application, with what to watch for on each.
Range (and at what reflectivity)
Maximum range is meaningless without a target reflectivity. A LiDAR rated "200 m" usually means against a 80–90% reflective target; the honest number is the range against a 10% reflective (dark, matte) target, which can be half or less. Always ask "range at 10%." For depth cameras, range is bounded by the technology: structured light to ~2–5 m, ToF to ~5–8 m, stereo to whatever your baseline supports (5–20+ m).
Accuracy vs precision (vs distance)
These are different and both matter. Accuracy is how close the mean measurement is to truth (bias); precision (or repeatability) is the spread of repeated measurements (noise). A sensor can be precise but inaccurate (consistent 3 cm offset) or accurate but noisy. Both degrade with distance — for stereo as Z², for ToF more gently, for LiDAR roughly flat until SNR collapses. Demand the curve, not a single headline number.
Field of view
Horizontal × vertical FoV sets how much of the world you see per frame. Wide FoV (good for obstacle awareness) trades against angular resolution and range (energy spread thinner). A 360° spinner sees everything; a forward MEMS unit sees a cone; a depth camera sees a frustum (commonly 70–90° H). Mounting a wide-FoV sensor solves "I have a blind spot" far more cheaply than adding a second narrow one.
Resolution: angular and spatial
For LiDAR, angular resolution (degrees between adjacent points, e.g. 0.1–0.4° horizontal, set by channels for vertical) determines how far away you can resolve a given object. For depth cameras, spatial resolution is the depth-map size (e.g. 640×480, 1280×720). More resolution is more detail and more compute; match it to the smallest feature you must detect at your working range.
Frame rate / point rate
LiDAR quotes points per second; cameras quote fps. Both are throughput. A 128-line spinner at 20 Hz over ~1024 horizontal samples and dual return is on the order of:
LiDAR point rate:
points/s = channels × horizontal_samples × rotation_Hz × returns
128 ch × 1024 az × 10 Hz × 2 returns = 2,621,440 pts/s ≈ 2.6 M pts/s
Bandwidth (XYZ + intensity, 16 bytes/point):
2.6e6 × 16 ≈ 42 MB/s sustained
That is real load on your bus and CPU — see point clouds and data.
Minimum range (the forgotten spec)
Every active sensor has a blind zone up close where the return saturates or the geometry breaks. Structured-light and ToF sensors often cannot measure inside 0.2–0.3 m; a wide-baseline stereo rig loses near objects because they fall outside both frustums. For a wrist-mounted manipulation camera, minimum range is frequently the binding constraint, not maximum — you cannot grasp what is too close to see.
Sunlight performance
The great divider. Passive stereo: works (it loves texture and sunlight provides it). LiDAR: 905 nm degrades, 1550 nm and FMCW shrug it off. ToF: degrades significantly. Structured light: fails. If any part of your robot's life is outdoors in daylight, this single row of the spec table eliminates half the candidates before you read anything else.
Power and thermal
LiDARs draw 8–25 W and run warm; depth cameras draw 1–5 W over USB but the IR projector and the on-board depth ASIC add heat in a sealed enclosure. On a battery robot, sensor power is a real fraction of the budget, and thermal throttling of a depth ASIC in a hot enclosure is a classic field failure.
Rule of thumb: pick the one or two numbers that bind your application (often minimum range and sunlight for manipulators; range-at-10% and angular resolution for outdoor mobile) and treat the rest as tie-breakers. A sensor strong everywhere except your binding spec is the wrong sensor.
Point clouds and data
The output of a 3D sensor is a point cloud: a set of (x, y, z) points, often with intensity, ring index, timestamp, or RGB. It is the universal currency of 3D perception, and it is heavy.
Formats
The common containers: PCD (Point Cloud Library's native format), PLY (interchange/scanning), LAS/LAZ (geospatial/survey), and in robotics the live wire format is ROS 2's sensor_msgs/PointCloud2 — a packed binary buffer with a field descriptor. Depth cameras alternatively publish a depth Image (a 16-bit-per-pixel range map) plus CameraInfo, which you reproject to a cloud only when you need 3D — cheaper to move a depth image than a full cloud.
Density and the data-rate problem
Density is points per unit area at a given range, and it falls off with distance (the beam fan diverges). The earlier 2.6 M points/s, ~42 MB/s figure is per sensor — put three on a robot and you have an internal bandwidth and CPU problem before you have written a single perception algorithm. A naive nearest-neighbour query over a million-point cloud is murder; everything downstream assumes you have reduced the cloud first.
Downsampling, voxels, and cropping
The standard toolkit, in order of how often you reach for it:
- Pass-through / ROI crop — discard points outside a box of interest (e.g. ignore everything above 2 m or beyond 10 m). Cheapest, biggest win.
- Voxel grid — overlay a 3D grid of cubes (e.g. 5 cm), replace all points in a cube with their centroid. Uniform density, dramatic point reduction, the default first step.
- Statistical outlier removal — drop points whose neighbour distances are anomalous (kills sensor speckle and rain returns).
- Random / uniform subsampling — when you just need fewer points and do not care which.
Doing this on time is a real-time-systems problem: the filters must keep up with the sensor or your buffers back up and latency climbs — see real-time control. And the perception that runs on the reduced cloud (segmentation, detection) is the bridge back to 2D methods covered in the machine vision guide, increasingly via networks that consume raw points or voxelized clouds directly.
Rule of thumb: never run an algorithm on the raw cloud. Crop to your region of interest, then voxel-downsample to the coarsest resolution your task tolerates. A 5 cm voxel grid often cuts points 10–50× with no loss for navigation.
Where each sensor fits
The clean way to choose is by robot class, because the class fixes range, lighting, and the task.
Indoor AMR: 2D LiDAR (+ a depth camera)
An autonomous mobile robot rolling around a warehouse or hospital wants cheap, reliable, floor-level obstacle sensing and 2D SLAM. A single 2D LiDAR (Slamtec RPLidar, SICK, Hokuyo) at 270–360°, 8–25 m, ~10 Hz does the navigation. It is blind to anything off its scan plane — a tabletop, a forklift fork at chest height — so you add a forward-facing depth camera (often a RealSense or OAK-D) to catch overhangs and low obstacles. This 2D-LiDAR-plus-depth-cam pairing is the default AMR stack; the mobile robots guide covers the navigation side in depth.
Outdoor / autonomous vehicle: 3D LiDAR + cameras + radar
Outdoors, at speed, in sun and weather, you need long range, surround coverage, and redundancy. A 3D spinning or solid-state LiDAR (Hesai, Ouster, RoboSense, or FMCW for the long-range channel) provides metric geometry to 100–250 m; cameras add semantics and colour; radar adds velocity and all-weather robustness. No single sensor is trusted alone — the architecture is explicitly redundant and fused because the failure modes are uncorrelated (LiDAR struggles in heavy rain/dust, cameras in glare/dark, radar at fine resolution).
Manipulation: depth camera on the wrist or overhead
A robot arm picking parts needs accuracy at 0.3–1.5 m, not range. Mount a depth camera either eye-in-hand (on the wrist, moving with the gripper for close inspection and active viewpoint selection) or eye-to-hand (fixed overhead, stable world frame). For high-precision bin-picking, structured light (Photoneo, Zivid) wins on accuracy; for general pick-and-place, active stereo or ToF is faster and cheaper. The grasp pose this produces feeds the kinematics and planning covered in the motion planning & kinematics guide, and the gripper choice in the grippers/end-effector literature. Minimum range and the eye-in-hand calibration (hand-eye transform) are the usual integration headaches.
Humanoid: multi-sensor, fused
A humanoid does all of the above — navigate, perceive obstacles at varying heights, and manipulate — so it carries a suite: a head depth camera or two for manipulation and near-field, often a LiDAR or 360° camera ring for locomotion awareness, plus an IMU for the balance loop. The defining problem is fusion across a moving, articulated body: every sensor's pose changes as the robot walks, so the transform tree (and its timing) is as critical as any single sensor. The humanoid hardware guide covers the platform; the takeaway here is that humanoids are the ultimate sensor-fusion problem, not a single-sensor problem.
SLAM and sensor fusion
A 3D sensor produces geometry in its own frame. SLAM (Simultaneous Localization And Mapping) is what turns a stream of those frames into a consistent map and a robot pose within it. It is the dominant consumer of LiDAR and depth data on mobile robots.
LiDAR SLAM
Geometric and robust. Algorithms like LOAM/LIO-SAM (LiDAR-inertial) and point-to-plane ICP variants register successive scans by matching geometry — edges, planes, surfaces. LiDAR SLAM is accurate, works in the dark, and is largely lighting-independent, which is why it dominates outdoor and large-scale mapping. Its weaknesses are geometrically degenerate environments (a long featureless corridor or tunnel where every scan looks the same) and the cost/bulk of the sensor.
Visual SLAM
Cheap and feature-rich. ORB-SLAM, VINS-Fusion, and similar track visual features (or direct pixel intensities) across frames, often fused with an IMU (visual-inertial odometry). Cameras are cheap, light, low-power, and carry semantics LiDAR cannot. The weaknesses mirror cameras': they fail in the dark, in low texture, and under rapid lighting change, and monocular visual SLAM has an inherent scale ambiguity (you do not know absolute metres without a second camera, a depth sensor, or an IMU to anchor scale).
Fusion and loop closure
The strong systems fuse: LiDAR for metric geometry, camera for texture/semantics, IMU for high-rate motion between frames. Fusion fills each sensor's blind spots — the IMU bridges the gap when LiDAR sees a featureless wall; the camera resolves which way a symmetric corridor actually goes.
Every SLAM system fights drift: small per-frame errors accumulate into a map that bends. Loop closure is the fix — recognizing a previously visited place and adding a constraint that snaps the accumulated error back into consistency. Reliable loop closure (visual bag-of-words, LiDAR scan-context descriptors) is what separates a map that closes neatly when you return to the start from one that shows two offset copies of your office. The pose estimate this produces feeds straight into the planner — see the motion planning & kinematics guide.
Rule of thumb: odometry tells you how far you have moved; loop closure tells you where you actually are. A SLAM system without robust loop closure is just dead reckoning with extra steps.
Selecting a 3D sensor
Choose in this order — each criterion eliminates candidates before the next: range → lighting → accuracy → field of view → budget → integration.
The decision flow
- Range and minimum range. Indoor close (0.3–2 m)? Depth camera. Indoor mid (2–8 m)? Depth camera or 2D LiDAR. Outdoor or beyond 10 m? LiDAR. Check the minimum range against your closest target.
- Lighting. Any direct sun? Eliminate structured light immediately; favour passive/active stereo or 1550 nm/FMCW LiDAR. Dark or featureless indoors? Eliminate passive stereo; use active stereo, ToF, or LiDAR.
- Accuracy. Sub-mm at close range for inspection/bin-picking? Structured light. Centimetres for navigation? Almost anything. Remember stereo's
Z²error growth. - Field of view. Need 360°? Spinning LiDAR or a camera ring. A forward cone is enough? MEMS LiDAR or a single depth camera.
- Budget and power. 2D LiDAR and depth cameras are cheap and low-power; 3D and FMCW LiDAR are not.
- Integration. A ROS 2 driver, good documentation, and a stable point-cloud timestamp are worth more than 5% on any spec.
Real-product comparison
Representative 2026 products with defensible figures (always confirm against the current datasheet — variants differ):
| Product | Type | Range (typ) | FoV (H×V) | Resolution / channels | Rate | Notes |
|---|---|---|---|---|---|---|
| Slamtec RPLidar A3 | 2D LiDAR | ~25 m | 360° | 0.225° ang. | 10–20 Hz | Cheap indoor AMR / 2D SLAM |
| Ouster OS1-128 | 3D digital spinning | ~120–170 m | 360° × 45° | 128 ch | 10–20 Hz | SPAD/CMOS, ~2.6 M pts/s |
| Hesai Pandar XT32 | 3D spinning | ~120 m | 360° × 31° | 32 ch | 10–20 Hz | Robust mid-range mobile |
| Livox Mid-360 | Solid-state (prism) | ~40–70 m | 360° × 59° | non-repeating | 10 Hz | Low cost, dense w/ integration |
| Intel RealSense D455 | Active stereo | 0.6–6 m | ~87° × 58° | up to 1280×720 depth | up to 90 fps | Works in sun + dark; 95 mm baseline |
| Stereolabs ZED 2i | Passive stereo | 0.3–20 m | ~110° | up to 2208×1242 | 15–100 fps | 120 mm baseline; outdoor range |
| Luxonis OAK-D Pro | Active stereo + NPU | 0.3–12 m | ~80° | 1280×800 depth | ~30–60 fps | On-board AI inference |
| Microsoft Azure Kinect / Orbbec Femto | iToF | 0.25–5.5 m | 75°×65° (wide) | up to 1024×1024 depth | 30 fps | Dense indoor depth; multipath-prone |
| Photoneo PhoXi | Structured light | 0.4–2 m | scanner | sub-mm | ~few Hz | Bin-picking accuracy king |
(Figures are nominal and configuration-dependent; "range" for LiDAR is at favourable reflectivity unless noted.)
Integration notes (ROS 2)
Nearly every sensor above ships a ROS 2 driver. The patterns to know:
- LiDAR publishes
sensor_msgs/PointCloud2(and often a per-point timestamp/ring field crucial for de-skewing motion). Ouster, Hesai, Livox, and Slamtec all maintain ROS 2 drivers; Livox uses its ownCustomMsgyou usually convert. - Depth cameras publish a
depth Image, aCameraInfo, and optionally aPointCloud2. Therealsense2_camera,zed_ros2_wrapper, anddepthai-rospackages are the standard wrappers. - Time synchronization is the silent killer: if your LiDAR, camera, and IMU timestamps are not on the same clock (PTP/hardware sync or careful host-side stamping), fusion and SLAM degrade in ways that look like sensor noise but are really timing. Solve clocking before you blame the algorithm.
- TF tree: every sensor needs an accurate static (or dynamic, for articulated bodies) transform to the robot base. A 2 cm or 1° error in a sensor mount becomes a systematic depth error downstream.
The ROS 2 guide covers the middleware, QoS, and time-handling that make or break a multi-sensor perception stack.
Rule of thumb: budget as much engineering time for the driver, timestamps, and TF tree as for selecting the sensor. The hardware rarely fails; the integration usually does.
Frequently asked questions
Do I need LiDAR if I already have a depth camera? Often no, indoors and at short range — a good active-stereo or ToF camera covers 0.3–6 m densely and cheaply. You need LiDAR when you go outdoors in sun, need range beyond ~10 m, need 360° coverage, or need lighting-independent geometry for robust SLAM. Many robots run both: LiDAR for the long/wide picture, depth cam for the close/dense one.
Why does my depth camera have holes in the depth image? Holes mean the sensor got no usable measurement for those pixels. For passive stereo it is lack of texture (blank walls, glossy surfaces); for structured light or ToF it is sun saturation, an out-of-range surface, a specular reflection bouncing the light away, or a black/absorptive material. Active IR projection, lighting control, or a different technology fixes most of it.
905 nm or 1550 nm LiDAR — which should I buy? For most robotics (indoor, mobile, mid-range) 905 nm is cheaper and entirely adequate. Choose 1550 nm when you need long range (200 m+), strong sun robustness, or higher optical power within eye-safe limits — typically automotive and outdoor long-range applications. You will pay substantially more for the InGaAs detector and laser.
What is the real difference between accuracy and precision for these sensors? Accuracy is bias — how far the average reading is from truth. Precision is repeatability — how much repeated readings of the same point scatter. A sensor can be precise but biased (consistent 3 cm offset, correctable by calibration) or accurate but noisy (right on average, useless per-frame). Calibration fixes accuracy; averaging or a better sensor fixes precision. Specify both, versus distance.
Why is my ToF camera reading corners as rounded or pushed back? Multipath. Light bounces between the two walls of the corner and arrives late, corrupting the per-pixel time/phase measurement. It is intrinsic to flood-illuminated ToF. Mitigations: multi-frequency capture, multipath-aware processing, or switching to structured light/stereo for geometrically tricky scenes.
Can stereo or structured light work outdoors? Passive stereo: yes, and it often prefers sunlight because sun provides the texture it needs to match. Active stereo: yes, falling back to passive matching when the IR projector is washed out. Structured light: no — direct sun (~1000 W/m²) overwhelms the milliwatt projected pattern. ToF: degraded but sometimes usable in shade.
How far can a stereo camera actually measure?
It depends entirely on baseline B and focal length f, because Z = f·B/d and error grows as Z². A 95–120 mm baseline module is good to roughly 6–20 m before error becomes unusable; survey rigs with metre-class baselines reach much further. There is no fixed answer — compute ΔZ ≈ Z²·Δd/(f·B) for your rig and your accuracy tolerance.
What sensor should I put on a robot arm for picking? A depth camera, mounted eye-in-hand (on the wrist) or eye-to-hand (fixed overhead). For precision bin-picking of small or shiny parts, structured light (Photoneo, Zivid). For general pick-and-place, active stereo (RealSense, OAK-D) or ToF. The binding spec is usually minimum range and the hand-eye calibration, not maximum range.
Is FMCW LiDAR worth the premium? If you need per-point velocity (instant moving-object detection, better ego-motion), strong immunity to sunlight and to other LiDARs, and long range, yes. For an indoor AMR or a short-range manipulator, no — you are paying for capabilities you will not use. It is an automotive and long-range outdoor technology today.
How do I keep point-cloud processing real-time? Reduce the cloud before you process it: crop to your region of interest, then voxel-downsample (a 5 cm grid commonly cuts points 10–50× for navigation), then run outlier removal. Profile against the sensor's frame period — if a filter takes longer than 1/rate, buffers back up and latency grows. See the real-time control guide.
LiDAR SLAM or visual SLAM? LiDAR SLAM is more robust and lighting-independent — use it outdoors, in the dark, or where geometry is rich. Visual SLAM is cheaper, lighter, and carries semantics — good indoors with texture and on cost/weight-constrained platforms. The best systems fuse both with an IMU and rely on loop closure. Geometrically degenerate spaces (long corridors, tunnels) hurt LiDAR SLAM and favour fusion.
Why do my fused sensors disagree even though each one is calibrated? Almost always timing or TF. If the sensors are not on a synchronized clock, a moving robot stamps the same world point at slightly different times, and fusion smears it. Likewise a small error in the static transform between sensors becomes a systematic offset. Fix clocking (PTP/hardware sync) and the TF tree before suspecting the sensors — see the ROS 2 guide.
Related guides
- Mobile Robots: AMRs & AGVs — The Ultimate Guide
An engineer's deep guide to mobile robots: AGV vs AMR, drive and chassis kinematics, navigation sensing, SLAM, path planning, ISO 3691-4 and R15.08 safety, opportunity charging, fleet software, and how to actually select and deploy a fleet.
- Stepper Motors & Drivers: The Ultimate Guide
An engineer-grade guide to stepper motors and drivers: how steps and microsteps really work, NEMA frame sizes, the torque-speed curve, resonance and missed steps, A4988 vs Trinamic TMC drivers, closed-loop steppers, and honest sizing math.
- Servo Motors: The Ultimate Guide
A deep, engineer-grade guide to servo motors: RC vs industrial vs smart serial servos, PWM and closed-loop control, datasheet specs, cascaded PID, sizing math, failure modes, and a real-product comparison table.
- Linear Motion Systems: Rails, Ball Screws & Linear Motors — The Ultimate Guide
A working engineer's guide to linear motion: profile rails and recirculating-ball guides, ball/lead/roller screws, belt and rack drives, and linear motors — with preload classes, accuracy grades, life and critical-speed math, real parts, and a selection workflow.
- Brushless DC Motors (BLDC) for Robotics: The Ultimate Guide
A robotics engineer's deep dive into brushless DC motors: Kv vs Kt, trapezoidal vs FOC commutation, sensored vs sensorless, gimbal/QDD actuators, datasheet math, and how to size a BLDC for a robot joint or drone.
- Robot Actuators: Electric, Hydraulic & Pneumatic — The Ultimate Guide
A working engineer's guide to robot actuators — electric, hydraulic, pneumatic, series-elastic, QDD, and soft — with real power/force-density numbers, products, and a selection cheat-sheet.