# Robo2u Blog — Full Content > Complete plain-text dump of every guide on blog.robo2u.com. This file is provided for LLM crawlers and retrieval systems that prefer ingesting full content in a single fetch. For the curated overview, see /llms.txt. Each guide is canonically published at the URL preceding it. Site: https://blog.robo2u.com Publisher: Robo2u Author: Robo2u Editorial --- # Stepper Motors & Drivers: The Ultimate Guide URL: https://blog.robo2u.com/posts/stepper-motors-ultimate-guide/ Published: 2026-06-19 Updated: 2026-06-20 Tags: stepper-motors, steppers, microstepping, nema-17, closed-loop-stepper, stepper-driver, motion-control, robotics-hardware, guide Reading time: 36 min > An engineer-grade guide to stepper motors and drivers: how steps and microsteps really work, NEMA frame sizes, the torque-speed curve, resonance and missed steps, A4988 vs Trinamic TMC drivers, closed-loop steppers, and honest sizing math. A stepper motor is the most honest actuator in the catalog and the most misunderstood. It is honest because it does exactly one thing: given a pulse, it advances the rotor by a fixed angle and holds there. No feedback, no controller smarts, no surprises — until you ask it to go fast or push hard, at which point it lies to you silently by skipping steps and never telling anyone. That gap between "it just works" and "it failed without a fault flag" is where most stepper grief lives. The misunderstanding usually starts with microstepping. Marketing puts "1/256 microstepping, 51,200 steps/rev" on the box and an engineer reads that as 51,200 distinct positions of usable resolution. It is not. A 1.8° stepper is accurate to maybe ±5% of a full step no matter how finely you slice it, and most of those microsteps carry so little incremental torque they cannot move the load against friction. Understanding *why* is the difference between using microstepping as the smoothing tool it actually is and trusting it as the precision tool it pretends to be. > **The take**: The stepper's superpower is open-loop positioning with zero tuning and zero feedback hardware — and that is also its trap. The single most expensive mistake is sizing on holding torque (the big number on the datasheet) and ignoring the torque-speed curve, where usable torque collapses as RPM climbs. A stepper picked on holding torque alone will stall the first time it accelerates a real load. Size on pull-out torque *at your operating speed*, drive it from a high bus voltage through a current-chopping driver, and either keep a comfortable margin or add an encoder and stop pretending it is open-loop. Companion reading: [servo motors](/posts/servo-motors-ultimate-guide/), [brushless DC motors](/posts/brushless-dc-motors-bldc-ultimate-guide/), [motor controllers and FOC](/posts/motor-controllers-foc-ultimate-guide/), [encoders](/posts/encoders-ultimate-guide/), and [linear motion systems](/posts/linear-motion-systems-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [What a stepper motor actually is](#what-is) 3. [How a stepper works: phases, detents, and step modes](#how-it-works) 4. [NEMA frame sizes and what they mean](#nema-frames) 5. [Unipolar vs bipolar](#unipolar-bipolar) 6. [The torque-speed curve and the four torques](#torque-speed) 7. [Resonance, missed steps, and how to avoid them](#resonance) 8. [Microstepping: resolution vs usable torque](#microstepping) 9. [Stepper drivers: A4988/DRV8825 vs Trinamic TMC](#drivers) 10. [Closed-loop steppers: bolt on an encoder](#closed-loop) 11. [Steppers vs servos vs BLDC: an honest decision guide](#vs) 12. [Sizing and selection](#sizing) 13. [Applications](#applications) 14. [Frequently asked questions](#faq) ## Key takeaways - A stepper is a **brushless motor with many magnetic detents** that you position by counting pulses, open-loop. It holds position by holding current in its windings — no feedback sensor required, which is the whole point and the whole risk. - The two standard sizes you will actually use are **1.8°/step (200 steps/rev)** and **0.9°/step (400 steps/rev)** hybrid steppers. Everything else is a niche. - **Holding torque is the headline number and the wrong one to size on.** Torque falls with speed along the **pull-out (torque-speed) curve**; at a few hundred RPM a NEMA 17 may have a third of its holding torque left. - **Microstepping improves smoothness and reduces resonance, not accuracy.** Incremental torque per microstep follows a sine, so the finest microsteps make near-zero torque. Positional accuracy stays at roughly **±5% of a full step** regardless of microstep ratio. - A stepper's torque comes from **current**, not voltage; speed capability comes from **voltage**. Drive a low-resistance, low-inductance stepper from **24–48 V** through a chopping driver to push current into the windings fast enough at speed. - **Cheap step/dir drivers (A4988, DRV8825)** chop current with fixed off-time and are fine for printers; **Trinamic TMC2209/TMC5160** add quiet StealthChop, high-torque SpreadCycle, sensorless stall detection (StallGuard), and UART/SPI configuration. - **StealthChop is silent but soft**; **SpreadCycle is louder but holds torque at speed.** Real machines often run StealthChop at low speed and switch to SpreadCycle above a velocity threshold. - **Resonance near 100–200 full-steps/s (often 1–3 rev/s)** can make a stepper lose all torque and stall. Microstepping, a little load inertia, mechanical damping, and avoiding constant speeds in the resonant band are the fixes. - A **closed-loop stepper** adds a rotor encoder and a controller that closes a position/current loop — turning the stepper into a coarse-pole servo (Teknic ClearPath-SC, Leadshine, Oriental Motor AlphaStep). It cannot skip steps silently and runs cooler. - Steppers win on **cost, low-speed holding torque, and zero-tuning open-loop positioning**; servos and BLDC win on **high-speed power density, efficiency, and dynamic response**. Crossover is around a few hundred RPM and a few hundred watts. - Steppers **dissipate full rated current even at rest** (holding), so they run hot — 60–80 °C surface is normal. A servo at rest with no load draws almost nothing. - Size with margin: pick a motor whose **pull-out torque at your top speed** exceeds your worst-case load torque by **about 1.5–2×**, then verify current, voltage headroom, and thermal rise. ## What a stepper motor actually is A stepper motor is a brushless permanent-magnet (or hybrid) motor built with a large number of magnetic poles so that, instead of spinning freely, it snaps to a sequence of discrete equilibrium positions — steps. You move it by energizing its windings in a pattern that walks those equilibrium points around the rotor. Count the patterns and you know, in principle, exactly where the shaft is. That "in principle" is doing heavy lifting. A stepper is the canonical **open-loop positioning device**: there is no encoder, no sensor, no controller checking whether the rotor actually followed. You command a step, the driver pushes current, and the motor *should* move one increment. If the load torque exceeds what the motor can deliver at that instant, the rotor fails to advance — it **skips a step** — and nothing tells you. The commanded count and the real position diverge, permanently, until you re-home. Contrast this with a [servo](/posts/servo-motors-ultimate-guide/), which closes a loop around a feedback sensor and refuses to be wrong silently. The stepper trades that self-correction for radical simplicity: no encoder to buy, no loop to tune, no commutation feedback. For a huge class of machines — 3D printers, small CNC, lab automation, optics stages — that trade is exactly right, because the loads are predictable and a generous torque margin makes skipped steps a non-event. ### Why steppers persist You could ask why, in 2026, anyone uses an open-loop actuator at all when a brushless servo gives you feedback for not much more money. Three reasons keep steppers alive: 1. **Zero-speed holding torque without a control loop.** A stepper just *holds* — energize the windings and the rotor sits in a magnetic detent, stiff and repeatable, with no tuning and no risk of loop instability. A servo holding position is a closed loop fighting to stay at zero error. 2. **Deterministic open-loop positioning.** No homing math, no observer, no encoder alignment. Step count is position. This makes firmware trivial — one reason every hobby 3D printer is a stepper machine. 3. **Cost.** A NEMA 17 stepper plus a $5 driver chip undercuts any servo-grade closed-loop axis. At low speed and modest power, nothing beats it on dollars per positioned axis. > **Rule of thumb**: if your axis spends most of its life holding a static position at low speed, and you can afford a 2× torque margin, a stepper is almost always the cheapest correct answer. If it spends its life accelerating hard or running fast, look at a servo or BLDC. ## How a stepper works: phases, detents, and step modes The dominant type by a wide margin is the **hybrid stepper** — a permanent-magnet rotor with finely toothed pole pieces, surrounded by a stator wound in two phases (call them A and B). The "hybrid" name is because it combines the permanent-magnet rotor of a PM stepper with the toothed reluctance structure of a variable-reluctance stepper, getting the best of both: strong torque and fine step angle. ### Where the steps come from A standard hybrid stepper has **50 rotor teeth**. Each electrical cycle of the two phases advances the rotor by four steps, and there are 50 such cycles per revolution: ``` Steps per revolution = rotor teeth × 4 = 50 × 4 = 200 full steps/rev Full-step angle = 360° / 200 = 1.8° ``` That is where the ubiquitous **1.8° / 200-step** stepper comes from. A 0.9° stepper has 100 rotor teeth and gives 400 steps/rev. Cheaper or specialized parts exist at 7.5° (48 steps/rev, common in old PM steppers) and other angles, but in robotics and motion control you will see 1.8° everywhere and 0.9° when you want finer native resolution. ### Energizing the phases The two phases are electromagnets. The rotor's permanent magnet wants to align with the net stator field. By controlling the *direction* and *magnitude* of current in phase A and phase B, you steer that net field vector, and the rotor follows it to the new equilibrium — the next detent. - **Full step (one phase on):** energize A+, then B+, then A−, then B−. Four positions per electrical cycle, lowest resolution, simplest. - **Full step (two phases on):** energize A and B together, in the four sign combinations. Same 1.8° step but ~40% more torque because both windings contribute, at the cost of more heat. This is the normal full-step mode. - **Half step:** alternate between one-phase-on and two-phases-on states, doubling the positions to eight per cycle — 400 half-steps/rev for a 1.8° motor. Torque ripples between the two states. - **Microstep:** instead of full-on/full-off, the driver feeds *sinusoidally weighted* current to both phases so the field vector points at intermediate angles. Now the rotor settles between the full-step detents. ``` Microstep resolution: steps/rev = 200 × microstep_ratio Full step (1/1) -> 200 steps/rev (1.8°) Half step (1/2) -> 400 steps/rev (0.9°) 1/4 step -> 800 steps/rev (0.45°) 1/8 step -> 1,600 steps/rev (0.225°) 1/16 step -> 3,200 steps/rev (0.1125°) 1/32 step -> 6,400 steps/rev (0.05625°) 1/256 step -> 51,200 steps/rev (0.00703°) <- not 51,200 useful positions ``` ### Detent torque vs holding torque Cut power entirely and a hybrid stepper still resists rotation a little — you can feel the "clicks" if you turn the shaft by hand. That is **detent torque** (also called residual torque), produced by the permanent magnet alone, with no current. It is typically **5–10% of holding torque** and you mostly account for it as a nuisance: it adds to the torque the motor must overcome at micro-step boundaries and degrades microstep accuracy. **Holding torque** is the torque the energized motor produces to resist being pushed off a step, at rated current, standing still. It is the big datasheet number — and as the next sections hammer home, it is not the number that determines whether your machine works at speed. ## NEMA frame sizes and what they mean "NEMA 17" tells you the **faceplate size and bolt pattern**, nothing else. NEMA frame numbers are the faceplate width in tenths of an inch: NEMA 17 = 1.7 in (43.2 mm) square face. It says nothing about length, torque, current, or step angle. A 20 mm-long NEMA 17 pancake and a 60 mm-long NEMA 17 are both "NEMA 17" with a 3–4× torque difference. So the frame number is a mounting and rough-size category; **torque comes from the frame size *and* the body length** (more iron and copper, more torque). Within a frame you choose length to get the torque you need. | NEMA frame | Face size | Common holding torque | Typical rated current | Where it fits | |---|---|---|---|---| | NEMA 11 | 28 mm (1.1 in) | 0.06–0.12 N·m | 0.5–0.67 A | Optics stages, small lab automation, cameras | | NEMA 17 | 42.3 mm (1.7 in) | 0.2–0.65 N·m | 0.8–2.0 A | 3D printers, desktop CNC, small robots, pipetting | | NEMA 23 | 56.4 mm (2.3 in) | 0.9–3.0 N·m | 2.0–4.5 A | CNC routers, larger gantries, conveyors, automation | | NEMA 34 | 86 mm (3.4 in) | 3.0–13 N·m | 4.0–6.0 A | Large CNC, lathes, plasma tables, heavy gantries | A few practical notes the table can't carry: - **Length matters as much as frame.** A NEMA 17 "high-torque" 48 mm body (e.g. a 0.55 N·m unit) and a NEMA 17 pancake (20 mm, ~0.1 N·m) share a bolt pattern and almost nothing else. Always read holding torque *and* body length, not just the frame. - **Higher current ≠ more torque for free.** Rated current sets how much copper loss (heat) the windings tolerate. Two motors of equal torque can have very different current ratings depending on winding turns; the low-current/high-resistance one needs more voltage to run fast. - **NEMA 23 is the workhorse of small CNC.** It hits the sweet spot of torque, cost, and driver availability. NEMA 34 is where you start considering a servo instead, because at that power level the stepper's low-speed-only advantage erodes. - **Shaft and mounting are not standardized within a frame.** NEMA 17 commonly uses a 5 mm shaft; NEMA 23 a 6.35 mm (1/4 in) or 8 mm; check before you buy pulleys and couplers. > **Rule of thumb**: choose the frame for the bolt pattern and rough torque class, then choose the body length for the actual torque. Buying "a NEMA 23" without specifying length is like buying "a bolt" without a length. ## Unipolar vs bipolar A stepper's two phases can be wired two ways, and it changes the driver you need and the torque you get. **Bipolar** steppers have two windings, four wires total. To reverse the field in a winding you reverse the current through it, which requires an **H-bridge per phase** — the driver must source and sink current both ways. This is what every modern driver (A4988, DRV8825, TMC) does. Bipolar uses the full copper of each winding in both directions, so it gives the most torque per size. Four-wire steppers are bipolar-only. **Unipolar** steppers add a center tap on each winding — typically six wires (two windings + two center taps) or eight wires (every coil end brought out). The center tap lets a simple driver reverse the field by energizing one half-coil or the other, using cheap single-transistor switches instead of H-bridges. The catch: only **half the winding carries current at a time**, so for the same copper you get roughly **70% of the bipolar torque** (the active copper drops by half, torque scales with the square-root-ish loss). This is the old, cheap way, and it is largely obsolete now that integrated H-bridge driver chips are a couple of dollars. The useful trick: a **6-wire unipolar** motor can be driven **bipolar** by ignoring the center taps and using only the full windings — you get the full bipolar torque. An **8-wire** motor is the most flexible: you can series the half-coils (high inductance, more low-speed torque, lower current), parallel them (low inductance, better high-speed torque, higher current), or run unipolar. For a fast axis, parallel; for a slow high-torque axis, series. | Wiring | Wires | Driver needed | Relative torque | Notes | |---|---|---|---|---| | Bipolar | 4 | H-bridge (A4988/TMC) | 100% (reference) | Modern default; can't be rewired | | Unipolar (driven unipolar) | 5, 6 | Simple switches | ~70% | Cheap, legacy, lower torque | | Unipolar wired bipolar | 6 | H-bridge | 100% | Ignore center taps; full torque | | 8-wire (series) | 8 | H-bridge | 100%, high inductance | Best low-speed torque, lower current | | 8-wire (parallel) | 8 | H-bridge | 100%, low inductance | Best high-speed torque, higher current | > **Rule of thumb**: buy 4-wire bipolar unless you have a specific need for the flexibility of 8-wire. Never spec a unipolar-only driver in a new design — H-bridge chips have made unipolar drive a relic. ## The torque-speed curve and the four torques This is the section that prevents the most field failures. A stepper does not have "a torque." It has a torque that depends heavily on speed, and four distinct torque numbers that mean different things. ### The four torques - **Holding torque** — energized, standing still, rated current. The maximum static torque before the rotor is forced off its step. The biggest, most-quoted number. - **Detent torque** — de-energized, permanent magnet only. 5–10% of holding. The cogging you feel by hand. - **Pull-in torque** — the maximum load torque against which the motor can *start, stop, and reverse* without losing steps, at a given step rate, from a standstill (no acceleration ramp). This is the conservative, no-ramp limit. - **Pull-out torque** — the maximum load torque the motor can carry while *running* at a given speed (already up to speed). This is the curve you size against, and it is always higher than pull-in at the same speed. The region between the pull-in and pull-out curves is the **slew range**: you can run there, but you cannot instantly start or stop — you must ramp (accelerate and decelerate) into and out of it. Every stepper controller worth using ramps; instant start/stop into the slew range is how you lose steps. ### Why torque falls with speed A stepper winding is an inductor. To make torque you need current in it, and current in an inductor cannot change instantly: ``` For a winding: V = I·R + L·(dI/dt) At low speed, you have time for current to reach its full value each step, so torque ≈ holding torque. As step rate rises, each step gets shorter. There is less time for current to build before the driver switches to the next step. Average current — and therefore torque — falls. Above the corner speed it falls steeply. ``` The fix is **voltage**. The rate current rises is `dI/dt = (V − I·R)/L`. More applied voltage forces current into the inductance faster, extending the speed at which torque holds up. This is why steppers are driven at **24 V, 36 V, or 48 V** from a chopping driver even though the motor's rated voltage (rated current × winding resistance) might be only 2–3 V. The driver chops the high bus voltage to limit average current to the rated value at low speed, but the high bus is available to ram current in fast at high speed. ``` Approximate corner (knee) speed where torque starts to roll off: f_corner ≈ V_bus / (2π · L · I_rated) [steps related, order-of-magnitude] Higher V_bus -> higher corner speed -> torque holds to higher RPM Lower L (parallel 8-wire) -> higher corner speed ``` > **Rule of thumb**: a stepper's high-speed torque is set by your *bus voltage and winding inductance*, not by the holding-torque number. To go faster, raise the bus voltage (within the driver's and motor's limits) and pick a low-inductance motor. Doubling holding torque does little for top-speed torque if the inductance is high. A practical consequence: a NEMA 17 with 0.55 N·m holding torque might deliver only **0.15–0.20 N·m at 600 RPM (2,000 steps/s)**. If you size on the 0.55 N·m figure and your load needs 0.3 N·m at that speed, the machine works on the bench at low speed and stalls in production at speed. Always pull the torque-speed curve from the datasheet and read the torque *at your operating point*. ## Resonance, missed steps, and how to avoid them A stepper is a mass (rotor + load inertia) on a magnetic spring (the detent stiffness). Like any spring-mass system it has a natural frequency, and if you drive it at that frequency, the oscillation amplifies until the rotor swings far enough off its commanded step that it loses synchronism and stalls. This is **mid-band resonance**, and it is the classic stepper failure that looks like a haunting: the motor runs fine slow, runs fine fast, and stalls or screams at one particular speed in between. ### Where resonance lives The fundamental resonance for an unloaded or lightly loaded hybrid stepper is often in the **range of roughly 100–250 full-steps/s**, which for a 1.8° motor is about **0.5–1.25 rev/s (30–75 RPM)**. There are harmonics higher up. The exact frequency depends on rotor inertia, total load inertia, and detent stiffness, so adding load inertia *lowers* the resonant frequency — sometimes a useful tuning knob. ### How to kill it - **Microstep.** This is the number-one fix. Full-step drive slams the rotor from detent to detent, exciting the resonance hard. Microstepping moves the rotor in small smooth increments, so the impulsive excitation that rings the spring-mass system is gone. Most modern drivers default to 1/16 or finer precisely for this reason. - **Don't dwell in the resonant band.** If a constant-speed move must cross the resonant speed, ramp through it quickly rather than running at it. - **Add damping.** Mechanical: a friction damper or an inertial (viscous) damper on the rear shaft. The load itself often provides enough damping; bare motors on a bench are the worst case. - **Add or change inertia.** Coupling more inertia shifts the resonance and reduces its amplitude. - **Use a smart driver.** Trinamic chips actively damp mid-band resonance in their chopper modes; StealthChop in particular is much smoother through the resonant band than a fixed-off-time driver. ### The other ways steppers lose steps Resonance is one cause. The others: - **Torque overload.** Load torque exceeds pull-out torque at the current speed. The rotor falls behind, slips a pole, and the count is now wrong. - **Too-aggressive acceleration.** The torque needed to accelerate the inertia (`τ = J·α`) plus the load torque exceeds available torque during the ramp. Gentler accel ramps fix it. - **Insufficient bus voltage at speed.** Covered above — torque rolls off and the load wins. - **Undersized driver current.** If the driver current limit is set below what the motor needs, you throw away torque you paid for. (But over-setting it cooks the motor.) > **Rule of thumb**: when a stepper machine "loses position randomly," check in this order: (1) is it stalling at one specific speed? — resonance, microstep harder; (2) does it fail on hard accelerations? — soften the ramp or raise voltage; (3) does it fail at high speed only? — torque-speed limit, raise bus voltage; (4) is the driver current set correctly? Closed-loop steppers (next-but-one section) eliminate the silent part of all of these.
## Microstepping: resolution vs usable torque Microstepping is the most over-sold spec in the stepper world, so let's be precise about what it does and does not give you. ### What microstepping is The driver feeds the two phases currents weighted as sine and cosine of an electrical angle. As that angle advances in small increments, the net field vector rotates smoothly, and the rotor follows to intermediate equilibrium points between the full-step detents: ``` Phase A current: I_A = I_peak · cos(θ_e) Phase B current: I_B = I_peak · sin(θ_e) θ_e advances by one microstep each step pulse. 1/16 microstepping -> θ_e advances 360/64 electrical degrees per microstep (one full electrical cycle = 4 full steps = 64 microsteps at 1/16) ``` ### What microstepping actually buys you **Smoothness and resonance reduction — real and valuable.** Smooth current means smooth torque means quiet, low-vibration motion that doesn't excite resonance. This is the genuine reason to microstep, and it is why printers run 1/16 or finer. **Effective resolution for *motion*, not for *positioning accuracy*.** You can command the shaft in fine increments, which matters for things like extrusion smoothness, but the shaft will not reliably *stop* at all 51,200 of those positions. ### The microstep-accuracy myth Here is the part the datasheet hides. The torque holding the rotor at a microstep is the *incremental* torque, and because the field is a sine, the torque-per-microstep follows a sine too: ``` Restoring torque at electrical angle θ_e from a step: τ(θ_e) = τ_holding · sin(θ_e_error) Near a full-step detent the sine is steep -> stiff, accurate. Between detents (the finest microsteps) the incremental torque per microstep approaches zero -> the rotor can't reliably reach or hold those positions against friction. ``` Two consequences: 1. **Incremental torque per microstep is tiny at fine ratios.** Going from 1/128 to 1/256 roughly halves the already-small torque holding each new microstep. If static friction in your mechanism exceeds that incremental torque — and it usually does well before 1/64 — the rotor simply doesn't move on the next microstep. It moves in a "stiction step" only when enough microsteps have accumulated. 2. **Detent torque and manufacturing tolerances dominate accuracy.** A 1.8° motor's *step accuracy* is typically specified at **±5% of a full step, non-cumulative** — about **±0.09°**. Microstepping does not improve this; the magnetic and mechanical imperfections that cause it are unaffected by how finely you command. So 1/256 microstepping on a ±5% motor gives you 51,200 *commands* but the same ±0.09° *truth*. > **Rule of thumb**: microstep for smoothness and quiet, not for accuracy. Above about 1/16 you gain almost no real positioning benefit and only smoother motion. If you need true sub-step accuracy, you need an encoder (closed-loop stepper or servo) or mechanical reduction (a 5:1 gearbox or a fine-pitch leadscrew multiplies your *real* resolution far more honestly than microstepping does). A clean way to get genuine resolution: gear it down. A 1.8° motor through a 5:1 planetary gearbox gives 1,000 full steps/rev of *real, torque-backed* resolution (0.36°/step), and multiplies torque 5× too. That beats trusting 1/8 microstepping on the bare motor for any application where the position has to be right under load. See the [linear motion guide](/posts/linear-motion-systems-ultimate-guide/) for how leadscrew pitch turns step angle into real linear resolution. ## Stepper drivers: A4988/DRV8825 vs Trinamic TMC The driver is half the system. The same motor on a $5 A4988 and a $12 TMC5160 behaves like two different machines — one buzzy and rough, one silent and smooth. Here is what separates them. ### What every stepper driver does: current chopping A stepper is current-controlled. The driver's core job is to regulate the winding current to a setpoint regardless of bus voltage. It does this by **chopping**: it turns the H-bridge on to ramp current up, senses the current (via a sense resistor), and when it hits the limit, turns off (or recirculates) to let current decay, then on again — thousands of times per second. This is how a 2–3 V motor runs safely off a 36 V bus: the chopper holds average current at the rated value while the high voltage is available to slew current fast. The setpoint is set by `V_ref` (a trimmer or a register) and the sense resistor. Getting this right is the single most important driver adjustment — too low throws away torque, too high overheats the motor. ``` A4988 current limit: I_limit = V_ref / (8 · R_sense) DRV8825 current limit: I_limit = V_ref / (5 · R_sense) (check the specific board's R_sense; clones vary) ``` ### The classic chips: A4988 and DRV8825 **Allegro A4988** — the workhorse of a decade of RepRap printers. Up to ~35 V, ~1–2 A/phase with a heatsink, microstepping to 1/16. Cheap, robust, and *loud*: it uses fixed-off-time current decay that produces audible mid-band whine and rougher motion. Fine for non-critical, cost-sensitive axes. **TI DRV8825** — the A4988's bigger sibling. Up to 45 V, ~1.5–2.2 A/phase, microstepping to 1/32. Higher voltage and current ceiling than the A4988, and a different decay scheme. Still fixed-decay and still buzzy by modern standards, but the higher voltage rating makes it the better choice for faster axes. Both are pin-compatible "StepStick" modules and both are step/dir only — no configuration, no telemetry. ### The modern chips: Trinamic TMC Trinamic (now part of ADI) changed the game by putting intelligence in the driver. The two you'll meet: - **TMC2209** — up to ~28 V (45 V abs max), ~2 A RMS/phase, 1/256 microstepping (with on-the-fly microstep interpolation from coarser step input). Adds **StealthChop2** (near-silent voltage-mode PWM chopper), **SpreadCycle** (high-torque current-mode chopper), **StallGuard4** (sensorless load/stall detection), and **CoolStep** (automatic current reduction with load). Configured over **UART** or via pins. This is the default upgrade for any printer or small machine that wants to be quiet. - **TMC5160** — up to ~60 V (external MOSFETs let it drive several amps, big NEMA 23/34), SPI configuration, an onboard **motion controller** (ramp generator: feed it a target position and it generates the accel/run/decel profile internally), plus StealthChop/SpreadCycle/StallGuard/CoolStep. This is the serious one for higher-power, higher-speed machines. ### StealthChop vs SpreadCycle These are the two chopper modes and the choice matters: - **StealthChop** is a voltage-mode PWM chopper. It is **near-silent** and beautifully smooth at low speed, which is why TMC-equipped printers are so quiet. But it regulates current more softly, so it **loses torque at higher speeds and accelerations** and can stall a heavily loaded axis. - **SpreadCycle** is a cycle-by-cycle current-mode chopper. It is **louder** (a faint hiss/whine) but holds current — and therefore torque — accurately at speed and through hard accelerations. The standard configuration on a good machine: run **StealthChop below a velocity threshold** (quiet, smooth, when torque demand is low) and **automatically switch to SpreadCycle above it** (when you need torque at speed). The TMC chips do this handoff in hardware once you set the threshold register. | Feature | A4988 | DRV8825 | TMC2209 | TMC5160 | |---|---|---|---|---| | Max bus voltage | ~35 V | ~45 V | ~28 V (45 V abs) | ~60 V (ext. FETs) | | Current/phase (RMS) | ~1.2 A | ~1.6 A | ~1.4–2.0 A | ~3 A+ (FET-dependent) | | Microstepping | 1/16 | 1/32 | 1/256 (interp.) | 1/256 (interp.) | | Interface | Step/dir | Step/dir | Step/dir + UART | Step/dir + SPI | | Quiet (StealthChop) | No | No | Yes | Yes | | Torque mode (SpreadCycle) | n/a | n/a | Yes | Yes | | Sensorless stall (StallGuard) | No | No | Yes | Yes | | Onboard motion controller | No | No | No | Yes (ramp gen.) | | Typical use | Cheap printer axes | Faster printer/CNC | Quiet printers, small CNC | NEMA 23/34, fast machines | > **Rule of thumb**: for any new small machine, default to a TMC2209 (or TMC5160 if you're above ~28 V or driving NEMA 23/34). The silence, sensorless homing via StallGuard, and torque-at-speed of SpreadCycle are worth the few extra dollars. Reserve A4988/DRV8825 for cost-critical builds where buzz doesn't matter. ### Step/dir and the move to UART/SPI The lowest-common-denominator interface is **step/direction**: one pin pulses once per microstep, another sets direction. Simple, universal, and dumb — the driver knows nothing about velocity profiles; the host MCU must generate every pulse. **UART (TMC2209)** and **SPI (TMC5160)** let you configure current, microstep ratio, chopper mode, and stall thresholds at runtime, read back diagnostics (load, temperature, stall), and on the TMC5160 hand off the whole motion profile to the chip's ramp generator. For real-time motion-control context — how these pulses get scheduled deterministically — see the [real-time control systems guide](/posts/real-time-control-systems-ultimate-guide/). ## Closed-loop steppers: bolt on an encoder The entire weakness of a stepper is the open loop: it can fail silently. Add a rotor [encoder](/posts/encoders-ultimate-guide/) and a controller that uses it, and the failure mode goes away. This is the **closed-loop stepper** (sometimes "servo-stepper" or "step-servo"), and it is one of the best-value actuators in motion control. ### How it works You mount an encoder (usually 1,000–4,000 line, magnetic or optical) on the rotor's rear shaft. The controller now knows actual position, not just commanded step count. Two things change: 1. **It cannot lose steps silently.** If the rotor falls behind the commanded position, the controller increases current to catch up, and if it can't, it raises a *following-error fault* — you get told. No silent position loss. 2. **It only uses the current it needs.** A classic stepper burns full rated current to hold position even with no load — that's why they run hot. A closed-loop stepper holds with just enough current to maintain position, so it **runs much cooler and more efficiently** at rest and light load. Architecturally this is a servo with a many-pole motor: a current/torque inner loop, a velocity loop, and a position loop, exactly as in the [motor controllers guide](/posts/motor-controllers-foc-ultimate-guide/) — just with a stepper's 200-pole geometry instead of a BLDC's handful. Many closed-loop stepper drives now run full **field-oriented control (FOC)** on the stepper, which makes them smooth and quiet like a servo while keeping the stepper's huge low-speed torque. ### Where it sits between open-loop stepper and servo - It keeps the stepper's **high holding torque and high pole count** (great low-speed torque, fine native resolution). - It gains the servo's **closed-loop integrity** (no silent step loss, fault on overload, cooler running). - It still has the stepper's **torque-speed roll-off** — closed-loop doesn't create torque the motor can't make; it just uses what's there honestly and tells you when it runs out. ### Real products - **Teknic ClearPath-SC** — an integrated closed-loop "step-servo" (motor + drive + encoder in one NEMA 23/34 housing) with serial control, torque/velocity/position modes, and genuine servo behavior at a price well under a separate industrial servo system. The flagship of the category. - **Leadshine** (e.g. the iSV/CS-D series and integrated closed-loop steppers) — popular, affordable closed-loop steppers and drives widely used in CNC retrofits. - **Oriental Motor AlphaStep (AZ/AR series)** — closed-loop steppers with a built-in mechanical-absolute encoder (no homing needed, no battery), known for reliability in industrial automation. > **Rule of thumb**: if your axis matters — it carries a real load, runs near its limits, or a lost step means scrap or a crash — spend the extra ~50–100% over an open-loop stepper for a closed-loop one. You buy out the single worst stepper failure mode and get a cooler, quieter motor. It is often a better value than jumping all the way to a separate servo system. ## Steppers vs servos vs BLDC: an honest decision guide These three overlap, and vendors muddy the lines (a closed-loop stepper *is* a servo; a "servo" can be built on a BLDC). Cutting through it: - A **stepper** is a high-pole-count motor optimized for discrete positioning and high low-speed torque, usually open-loop. - A **servo** (see the [servo guide](/posts/servo-motors-ultimate-guide/)) is a *control architecture* — any motor plus feedback plus a closed loop — usually built on a low-pole-count brushless PMSM for high-speed power density. - A **BLDC/PMSM** (see the [BLDC guide](/posts/brushless-dc-motors-bldc-ultimate-guide/)) is the bare brushless motor; with FOC and feedback it becomes a servo, with six-step commutation it's a drone/fan motor. The honest trade-offs: | Attribute | Open-loop stepper | Closed-loop stepper | Servo (BLDC/PMSM + feedback) | |---|---|---|---| | Feedback | None (open-loop) | Encoder, internal | Encoder/resolver, full loop | | Low-speed holding torque | Excellent | Excellent | Good (loop holds it) | | High-speed power density | Poor (torque rolls off) | Poor–fair | Excellent | | Efficiency | Poor (full current at rest) | Good | Excellent | | Heat at standstill | High | Low | Very low | | Silent failure (lost steps) | Yes — the big risk | No (faults out) | No | | Tuning required | None | Minimal (often auto) | Real tuning needed | | Cost per axis | Lowest | Low–medium | Medium–high | | Best speed range | < ~600–1,000 RPM | < ~1,500 RPM | up to 3,000–6,000+ RPM | | Typical power sweet spot | < ~200 W | < ~500 W | 100 W – many kW | | Where it wins | Cheap predictable positioning | Reliable positioning, no tuning | Dynamics, speed, efficiency | ### The decision in plain terms - **Predictable load, low speed, cost-sensitive, you can afford a 2× torque margin** → open-loop stepper. 3D printers, lab stages, small CNC. - **Same low-speed regime but the load varies, a lost step is costly, or you want it cooler and quieter without tuning** → closed-loop stepper. CNC production, automated equipment. - **You need high speed, high efficiency, hard dynamics, or high power** → servo on a BLDC/PMSM. Robot joints, spindles, fast pick-and-place, vehicle drives. > **Rule of thumb**: the stepper-to-servo crossover sits at roughly **a few hundred RPM and a few hundred watts**. Below it, a stepper is cheaper and simpler and its weaknesses don't bite. Above it, the stepper's torque roll-off and standstill heat make a servo the right call. When you're near the line, a closed-loop stepper is the hedge. ## Sizing and selection Sizing a stepper correctly is mostly about respecting the torque-speed curve and the thermal limit. A repeatable procedure: ### 1. Compute the required torque at the worst operating point Total motor torque must cover the load torque plus the torque to accelerate the inertia: ``` τ_required = τ_load + τ_accel τ_accel = J_total · α (J in kg·m², α in rad/s²) J_total = J_rotor + J_reflected_load J_reflected_load = J_load / N² (N = gear ratio, if geared) ``` The worst case is usually peak acceleration at your top commanded speed. Compute `τ_required` there. ### 2. Read torque off the curve, not the headline Find your **operating speed in steps/s or RPM** and read the **pull-out torque at that speed** from the datasheet curve. Do **not** use holding torque. Apply margin: > **Rule of thumb**: pull-out torque at your top operating speed should exceed `τ_required` by about **1.5–2×**. Open-loop steppers have no way to recover from a momentary overload, so the margin is your only protection against a stall. ### 3. Set current and voltage - **Current**: set the driver limit to the motor's **rated current per phase** (RMS or peak as the driver expects — read carefully; A4988/DRV8825 `V_ref` math sets peak). This sets torque and heat. - **Voltage**: choose a bus voltage to push the corner speed above your operating speed. A common heuristic is **bus voltage ≈ 20–25× √(inductance in mH)** as a starting point, or simply: more voltage = faster torque, bounded by the driver's max and the motor's insulation/heat. 24 V is the printer default; 36–48 V for fast CNC. ``` Motor "rated voltage" = I_rated × R_phase (often only 2–4 V — ignore for bus sizing) Bus voltage is set by SPEED need, current limiting is the driver's job. ``` ### 4. Check inductance and pick the winding Lower inductance (mH) = higher corner speed = better high-speed torque, but needs more current for the same torque. For a fast axis, prefer a low-inductance motor or an 8-wire motor wired in parallel. For a slow high-torque axis, higher inductance (or series wiring) is fine and lets you use less current. ### 5. Verify thermal rise A stepper running at rated current dissipates `I²R` per phase continuously, even holding. Surface temperatures of **60–80 °C** are normal; the limit is the insulation class (often 130 °C / Class B) and what your mounting and nearby plastics tolerate. If it's too hot: reduce holding current (TMC CoolStep or a hold-current reduction), improve the heat path into the mount (the faceplate is the main thermal path), or move to a closed-loop stepper that only draws what it needs. ### 6. Decide gearing If you need more torque or finer *real* resolution, a gearbox multiplies both and reflects load inertia down by `N²` (helping the inertia match and resonance). A planetary gearbox on a NEMA 17 is often cheaper and more effective than jumping to a NEMA 23 — see the [gearboxes guide](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/). ## Applications Where steppers earn their keep, and what to watch in each: ### 3D printers (FDM) The canonical stepper machine. NEMA 17 on every axis and the extruder, typically 0.4–0.6 N·m motors, 24 V bus, TMC2209 drivers for silence and sensorless homing (StallGuard replaces endstop switches on some designs). Why steppers: cheap, predictable low-speed loads, open-loop positioning is good enough because the loads are light and well-characterized. The move to higher print speeds (input shaping, 200+ mm/s) is pushing some machines toward higher-voltage drivers (TMC5160) and even closed-loop extruders to prevent the under-extrusion that a skipped step causes. ### CNC routers and mills (small/desktop) NEMA 23 (often 1.9–3.0 N·m) is the workhorse, NEMA 34 for bigger gantries, on 36–48 V with DRV8825 or TMC5160 drivers — or increasingly closed-loop steppers (Leadshine) because a lost step in CNC means a ruined part and the cut load is variable and high. This is exactly the regime where closed-loop pays for itself: real loads, costly mistakes, still below the speed where you'd want a true servo. ### Lab automation and life sciences Pipetting robots, syringe pumps, sample-handling stages, microscope stages. NEMA 11 and NEMA 17, low speed, light predictable loads, and a premium on smooth, quiet, vibration-free motion (microstepping shines here) and on holding position precisely for long dwells (where the stepper's tuning-free holding torque is ideal). Often microstepped finely for smoothness even though the real accuracy comes from the leadscrew/gear reduction. ### Camera, optics, and instrumentation Focus and zoom drives, filter wheels, telescope mounts, beam-steering stages. NEMA 11/14/17 micro-steppers, low speed, and again the win is smooth holding and fine commanded resolution. Telescope mounts in particular exploit the stepper's ability to creep at extremely low, perfectly steady rates (sidereal tracking) that a servo would dither around. Backlash and detent torque are the enemies of pointing accuracy here, so high microstepping plus mechanical reduction (worm gears) is standard. ### Conveyors, indexers, and general automation Repetitive index-and-hold motion is a stepper's home turf: a feeder advancing a fixed pitch, a rotary index table stopping at stations. NEMA 23/34, often closed-loop in industrial settings (Oriental Motor AlphaStep with absolute encoder so there's no homing on power-up). Predictable cycle, hard holding, modest speed — the stepper's strengths line up perfectly. > **Rule of thumb across all of these**: steppers thrive where the duty cycle is *position-and-hold at low speed with a predictable load*. The moment an application demands sustained high speed, high efficiency, or aggressive dynamics, it has left stepper territory and you should be looking at a servo. ## Frequently asked questions **How many steps per revolution does a stepper motor have?** A standard hybrid stepper has 200 full steps per revolution, which is 1.8° per step, coming from 50 rotor teeth × 4 steps per electrical cycle. A 0.9° motor has 400 full steps/rev (100 rotor teeth). Microstepping multiplies the *commanded* increments — 1/16 gives 3,200 microsteps/rev — but does not add 1/16 worth of real positioning accuracy. **Does microstepping increase a stepper's accuracy?** No. Microstepping increases smoothness and reduces resonance and vibration, and it lets you command finer increments. But positional accuracy is limited by the motor's mechanical and magnetic tolerances — typically ±5% of a full step (about ±0.09° on a 1.8° motor), non-cumulative — and that figure does not improve with finer microstepping. The incremental torque per microstep also shrinks toward zero at fine ratios, so the rotor can't reliably stop at every microstep. For real resolution, gear it down or add an encoder. **Why does a stepper lose torque at high speed?** The windings are inductors, and current can't rise instantly. As step rate climbs, each step gets shorter and there's less time for current — and therefore torque — to build before the next step. Above the corner (knee) speed, torque falls steeply. The fix is a higher bus voltage, which forces current into the inductance faster, and/or a lower-inductance motor. **What voltage should I run my stepper at?** Run the *driver* from a bus voltage well above the motor's nominal "rated voltage." The rated voltage (rated current × winding resistance) is often only 2–4 V and is not what you supply — the chopping driver limits current regardless. Use 24 V for typical NEMA 17 printer axes, 36–48 V for faster or larger NEMA 23/34 machines. Higher voltage extends the speed at which torque holds up, bounded by the driver's and motor's ratings. **Why does my stepper get so hot?** An open-loop stepper draws its full rated current to hold position even when standing still with no load, dissipating I²R as heat continuously. Surface temperatures of 60–80 °C are normal and usually fine (insulation is typically rated to 130 °C). If it's too hot, reduce the holding current (many drivers offer a hold-current reduction, and TMC's CoolStep lowers current automatically under light load), improve heat-sinking through the faceplate mount, or switch to a closed-loop stepper that only draws the current it needs. **What's the difference between holding torque and pull-out torque?** Holding torque is the static torque the energized motor resists with when standing still at rated current — the big datasheet number. Pull-out torque is the torque the motor can carry while running at a given speed, and it falls as speed rises. You must size your machine on pull-out torque at your operating speed, not on holding torque, or it will work slow and stall fast. **A4988 vs DRV8825 vs TMC2209 — which driver should I use?** A4988 and DRV8825 are cheap step/dir drivers; the DRV8825 takes higher voltage (~45 V vs ~35 V) and current and is the better of the two for speed, but both are audibly buzzy. The TMC2209 adds near-silent StealthChop, torque-holding SpreadCycle, sensorless stall detection (StallGuard), and UART configuration for a few dollars more — it's the default upgrade for any quiet small machine. For above ~28 V or NEMA 23/34, step up to the TMC5160. **What is StealthChop vs SpreadCycle?** They are two chopper modes in Trinamic drivers. StealthChop is a voltage-mode PWM chopper that is near-silent and smooth at low speed but loses torque at high speed and hard accelerations. SpreadCycle is a current-mode chopper that's louder but holds torque accurately at speed. The usual setup runs StealthChop below a velocity threshold and switches to SpreadCycle above it, automatically. **What is a closed-loop stepper and is it worth it?** A closed-loop stepper adds a rotor encoder and a controller that closes a position/current loop — turning the stepper into a coarse-pole servo. It can't lose steps silently (it faults on a following error instead), and it only draws the current it needs, so it runs much cooler. It's worth it whenever a lost step is costly or the load runs near the motor's limits; products include Teknic ClearPath-SC, Leadshine, and Oriental Motor AlphaStep. **When should I use a servo instead of a stepper?** When you need sustained high speed (above roughly a few hundred to a thousand RPM), high efficiency, aggressive dynamics, or higher power (above a few hundred watts). Below that, a stepper is cheaper, needs no tuning, and its torque roll-off and standstill heat don't hurt you. Near the crossover, a closed-loop stepper is a good middle ground. **Why does my stepper stall or scream at one particular speed?** That's mid-band resonance: the rotor-plus-load mass on the magnetic detent spring has a natural frequency (often around 0.5–1.25 rev/s for a lightly loaded 1.8° motor), and driving at it amplifies oscillation until the motor loses synchronism. Fix it by microstepping (the biggest help), ramping quickly through the resonant speed instead of dwelling there, adding mechanical damping or load inertia, or using a TMC driver that actively damps resonance. **Should I buy a 4-wire, 6-wire, or 8-wire stepper?** Buy 4-wire bipolar for most designs — it's the modern default and works with every H-bridge driver at full torque. An 8-wire motor gives flexibility: series the coils for the best low-speed torque (higher inductance, lower current) or parallel them for the best high-speed torque (lower inductance, higher current). A 6-wire unipolar motor can be driven bipolar by ignoring the center taps for full torque. Avoid unipolar-only drive in new designs. **Can a stepper do torque or force control?** Open-loop, only crudely — torque is set by the current limit, but you have no feedback on whether the rotor is actually producing it (it may have stalled). A closed-loop stepper with a current/torque loop can do real torque control, like a servo. If force control matters, use a closed-loop stepper or a servo, not a bare stepper. ## Changelog - **2026-06-19** — Initial publication. --- # Servo Motors: The Ultimate Guide URL: https://blog.robo2u.com/posts/servo-motors-ultimate-guide/ Published: 2026-06-18 Updated: 2026-06-20 Tags: servo-motors, servos, rc-servo, dynamixel, closed-loop-control, robotics-hardware, actuators, motion-control, pwm, guide Reading time: 34 min > A deep, engineer-grade guide to servo motors: RC vs industrial vs smart serial servos, PWM and closed-loop control, datasheet specs, cascaded PID, sizing math, failure modes, and a real-product comparison table. A servo motor is not a kind of motor. That sentence trips up more engineers than it should. A servo is a *control architecture*: a motor plus a feedback sensor plus a controller that closes a loop around position (and usually velocity and torque underneath that). Strip out the sensor and the loop and you have a plain motor running open-loop. Bolt them on and almost any motor — brushed DC, brushless, AC induction, even a stepper — becomes a servo. The word describes what the thing *does*, not what's inside it. That distinction matters because the term spans three wildly different product worlds. A $9 hobby servo from a model-aircraft shop and a $1,200 Kollmorgen AC servomotor with a 24-bit absolute encoder are both "servos," and an engineer who conflates them will either over-spend by 100x or under-spec a joint into early failure. The job of this guide is to give you the mental model to tell them apart, read their datasheets honestly, size them correctly, and not get burned by the failure modes that the marketing copy never mentions. > **The take**: The single most expensive mistake in servo selection isn't buying too little torque — it's ignoring *reflected inertia and RMS torque*. Most engineers size on stall torque and no-load speed, both of which are peak, transient, marketing-friendly numbers. The joint actually lives or dies on its continuous RMS torque versus the thermally limited rated torque, and on whether the load inertia is within roughly 5–10x the rotor inertia. Get the inertia match and the duty cycle right and a "weaker" servo will outlast and outperform a "stronger" one chosen on stall torque alone. Companion reading: [brushless DC motors](/posts/brushless-dc-motors-bldc-ultimate-guide/), [gearboxes: harmonic and cycloidal](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/), [motor controllers and FOC](/posts/motor-controllers-foc-ultimate-guide/), and [encoders](/posts/encoders-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [What a servo motor actually is](#what-is) 3. [The three worlds: RC, industrial, and smart serial servos](#three-worlds) 4. [How an RC/hobby servo works](#rc-servo) 5. [Inside an industrial servo system](#industrial) 6. [Reading a servo datasheet](#datasheet) 7. [Smart serial servos for robotics](#smart-serial) 8. [Gearing and torque](#gearing) 9. [Control: cascaded loops, tuning, and limiting](#control) 10. [Sizing a servo for your joint](#sizing) 11. [Failure modes and thermal limits](#failure) 12. [Selection guide and comparison table](#selection) 13. [Practical wiring and power notes](#wiring) 14. [Frequently asked questions](#faq) ## Key takeaways - A servo = **motor + position sensor + closed-loop controller**. Remove any one and it is no longer a servo. The motor inside can be brushed DC, brushless (BLDC/PMSM), or AC. - Hobby/RC servos take a **1000–2000 µs pulse at ~50 Hz**; 1500 µs is neutral (center). The signal commands *position*, not speed. Modern industrial drives instead take ±10 V analog, step/direction, or EtherCAT/CANopen. - **Stall torque and no-load speed are peak numbers.** Size continuous motion on **rated (continuous) torque** and your **RMS torque** over the move profile, not on stall. - Keep **load-to-rotor inertia ratio** roughly in the **1:1 to 10:1** band (5:1 is a common sweet spot) for crisp, tunable response. Above ~10:1 you fight resonance and have to detune the loop. - Industrial servo control is a **cascade**: an inner current/torque loop (kHz), a velocity loop around it, and an outer position loop. You tune from the inside out. - Smart serial servos (Dynamixel X/P series) put the motor, gearbox, encoder, driver, and a microcontroller in one housing and talk **Protocol 2.0 over TTL or RS-485**, daisy-chained, each with an ID and baud (often 57,600 up to 4.5 Mbps). - **Digital RC servos** update the H-bridge at **300 Hz–1 kHz+** versus ~50 Hz for analog, giving tighter holding torque and faster response — at higher idle current and heat. - The **torque constant Kt** (N·m/A) and **back-EMF constant Ke** are two faces of the same number in SI units. Torque scales with current; speed scales with voltage. - Servos die from **I²t heating, stalls, gear stripping, brownout/under-voltage resets, and magnet demagnetization** from sustained over-current — usually thermal, rarely mechanical-first. - Run **separate logic and motor power rails with a common ground**, size for **inrush**, and add bulk capacitance near the drive. A shared 5 V rail browning out on a stall is the #1 cause of "my microcontroller randomly reboots." - Backlash, not torque, often sets joint accuracy. Plan for **0.1–0.5° backlash** in spur-gear RC servos; harmonic drives get you under **1 arc-min**. - Holding torque ≠ rated torque. A servo holding a static load still burns current and heat; size for the hold, not just the move. ## What a servo motor actually is A servo is a closed-loop motion device. You command a target — usually a position — and the system measures where it actually is, computes the error, and drives the motor to kill that error. That feedback loop is the whole point. Without it you have open-loop control: you command an effort and *hope* the output lands where you wanted. ### Open-loop vs closed-loop A brushed DC motor with a fixed voltage is open-loop. Load it down and it slows; the controller never knows or cares. A stepper driven by step pulses is also open-loop in its classic form — it *assumes* each pulse advanced one micro-step, and if it skips a step under load, your position is silently wrong forever. A servo refuses to be wrong silently. It watches the sensor. If the shaft is 2° short of target, it pushes harder. If it overshoots, it backs off or reverses. The error never gets to lie about itself. ### The three building blocks Every servo, from the $9 hobby unit to the $1,200 industrial one, is the same three parts: 1. **The motor (actuator).** Converts electrical power to mechanical torque. Brushed DC in cheap servos; brushless PMSM (permanent-magnet synchronous) in good ones; AC synchronous in big industrial units. 2. **The feedback sensor.** Measures actual output. A potentiometer in hobby servos; an incremental or absolute encoder, resolver, or magnetic (Hall/magnetoresistive) sensor in better ones. See the [encoders guide](/posts/encoders-ultimate-guide/) for the full taxonomy. 3. **The controller (drive).** Reads the command and the feedback, runs the control law (usually cascaded PID), and switches power to the motor through an H-bridge or three-phase inverter. > **Rule of thumb:** If a vendor sells you a "servo" but can't tell you what sensor closes the loop, you are buying a motor with optimistic marketing. ### Why not just use a stepper or a geared DC motor? Steppers are great open-loop for cost-sensitive, low-dynamics positioning (3D printers, small XY stages). But they lose steps under overload, run hot at hold, and waste current. Geared DC motors are cheap muscle with no idea where they are. Servos win when you need *accurate, repeatable position under varying load with good dynamics* — robot joints, gimbals, CNC axes, steering, throttle bodies. You pay for the sensor and the smarts, and you get a system that corrects itself. ## The three worlds: RC, industrial, and smart serial servos "Servo" covers three product categories that barely resemble each other. Pick the wrong world and nothing downstream works. ### World 1 — RC/hobby servos Self-contained boxes: motor, gear train, potentiometer, and a tiny control board, all in a plastic or metal case with a three-wire pigtail (power, ground, signal). You feed a PWM pulse, it moves to a position, typically over ~120–270° of travel. Cost: $5–$80. Examples: HiTec HS-422, Futaba S3003, Savöx SC-1258TG, and the digital high-torque units like the Savöx SB-2290SG. This world is for small robots, RC vehicles, animatronics, pan-tilt rigs, and prototypes. ### World 2 — Industrial servo systems A separate **servomotor** and **servo drive (amplifier)**, joined by a power cable and a feedback cable. The motor has a precision encoder or resolver; the drive does the closed-loop math, often with a fieldbus interface (EtherCAT, CANopen, PROFINET) back to a PLC or motion controller. Cost: hundreds to thousands of dollars per axis. Examples: Kollmorgen AKM/AKD, Yaskawa Sigma-7, Beckhoff AM8000, Mitsubishi MELSERVO, Bosch Rexroth. This world runs CNC machines, packaging lines, pick-and-place, and industrial robot arms. ### World 3 — Smart serial servos The newest category, built for robotics. Like an RC servo, everything is in one housing — but the controller is a real microcontroller, the feedback is a contactless magnetic encoder (often 12-bit, 4096 counts/rev), and you talk to it over a digital bus (TTL or RS-485) with a packet protocol. You can daisy-chain dozens on one bus, each addressable by ID, and read back position, velocity, current, temperature, and voltage. Cost: $25–$1,000. Examples: ROBOTIS Dynamixel X-series (XL330, XM430, XH540) and P-series, plus the Feetech STS/SCS line. This world dominates research robots, humanoids, quadrupeds, and serious hobby/educational arms. | Attribute | RC/Hobby | Industrial | Smart Serial (Dynamixel-style) | |---|---|---|---| | Typical cost/axis | $5–$80 | $300–$3,000+ | $25–$1,000 | | Motor type | Brushed (mostly) | PMSM / AC synchronous | Brushed or BLDC (coreless on premium) | | Feedback | Potentiometer | Encoder / resolver, 17–24 bit | Magnetic encoder, 12-bit typ. | | Command interface | PWM 1–2 ms @ ~50 Hz | ±10 V, step/dir, EtherCAT/CANopen | Serial packet (Protocol 2.0) | | Position range | 120–270° (limited) | Multi-turn, unlimited | 360° or multi-turn (extended mode) | | Telemetry back | None | Full (drive) | Position, vel, current, temp, voltage | | Holding torque | Yes, lossy | Yes, controlled | Yes, current-limited | | Where it fits | Models, prototypes, animatronics | CNC, packaging, factory automation | Research robots, humanoids, arms | ## How an RC/hobby servo works The RC servo is a beautiful piece of 1970s analog cleverness that has survived almost unchanged in concept. Understand it once and you understand half the small-robot world. ### The PWM position command Despite the name, the control signal is **not** PWM in the power-electronics sense (it carries no power and the duty cycle isn't what matters). It's a **pulse-width position code**: a pulse repeated at roughly **50 Hz** (every 20 ms), where the *pulse width* encodes the target position. ``` Standard RC servo signal (~50 Hz frame): 1000 µs pulse -> full one way (e.g. -60°) 1500 µs pulse -> center / neutral (0°) 2000 µs pulse -> full other way (+60°) |<------------------- 20 ms frame (50 Hz) ------------------->| |__ |__ | |________________________________________________________| |... |<>| pulse width = position command (1000–2000 µs) ``` The 1000–2000 µs range with 1500 µs neutral is the de-facto standard. Many servos accept a wider range (about 500–2500 µs) for extended travel, but pushing past the mechanical stops will stall and cook the motor. The frame rate is loose: analog servos tolerate 40–60 Hz; digital ones often accept much faster frames. ### What's inside Open the case and you find: a small brushed DC motor, a reduction gear train (often 3–6 stages of spur gears), a **potentiometer** geared to the output shaft, and a control board. The potentiometer is the sensor. As the output rotates, the pot wiper voltage changes. The control board compares that feedback voltage against a voltage derived from the incoming pulse width. The difference (error) drives an **H-bridge** that powers the motor in the direction that reduces the error. When the pot voltage matches the command, the motor stops. That's the whole loop — a position servo built from a comparator and a motor driver. ### Deadband No servo holds an infinitely precise null. There's a **deadband**: a small error window where the controller does nothing, to stop the motor from buzzing and hunting around the target. Cheap analog servos have a wide deadband (sloppy, ~5–10 µs equivalent); good digital servos shrink it (crisp, ~1–3 µs), which is why digitals "lock in" harder. ### Analog vs digital servos The mechanical guts are often identical. The difference is the control board. - **Analog servos** drive the motor with the ~50 Hz signal directly — the motor gets a power pulse only once per 20 ms frame. Cheap, low idle current, but soft holding torque and slow to respond, especially to small errors. - **Digital servos** use a microcontroller that re-samples the error and re-drives the H-bridge at **300 Hz to 1 kHz or more**, independent of the input frame rate. Result: faster response, tighter deadband, much stronger holding torque near the target — at the cost of higher idle current and more heat. > **Rule:** If your application needs the servo to *hold* against a load (a robot arm fighting gravity, a steering linkage), buy digital. If it just needs to slew to a position occasionally with little holding load, analog is cheaper and cooler. ### Continuous-rotation "servos" Pull out the pot and replace it with a fixed voltage divider, and the servo never reaches its target — so it spins continuously, with pulse width now commanding *speed and direction* instead of position. These "continuous rotation servos" are really just geared motors with a built-in PWM-to-speed driver. Convenient, but you've thrown away the closed loop; they're open-loop on speed. ## Inside an industrial servo system Industrial servos split the system into a **motor** and a **drive (amplifier)**, and that separation is the source of their performance. The drive is a serious piece of power electronics and DSP, not a comparator on a hobby board. ### The motor: AC servo vs brushless DC Most modern industrial servomotors are **permanent-magnet synchronous motors (PMSM)**, marketed as "AC servomotors." They're three-phase, sinusoidally driven, and electrically nearly identical to what the hobby/drone world calls a BLDC motor — the difference is mostly the back-EMF waveform (sinusoidal vs trapezoidal) and the control strategy. For the full treatment of the motor itself, see the [brushless DC motors guide](/posts/brushless-dc-motors-bldc-ultimate-guide/) and the [FOC controllers guide](/posts/motor-controllers-foc-ultimate-guide/). Key point: an "AC servo" and a "brushless DC servo" are siblings. Both are brushless PM machines. "AC servo" usually implies sinusoidal commutation with **field-oriented control (FOC)** and a high-resolution encoder; "brushless DC servo" sometimes implies simpler six-step trapezoidal commutation. Good industrial drives all do FOC now. ### The feedback device This is where industrial servos earn their price. Instead of a pot, you get: - **Incremental encoders** — high resolution (e.g. 2,000–10,000 lines, quadrature-multiplied to 8,000–40,000 counts/rev), but need homing on power-up. - **Absolute encoders** — know position at power-on without homing. Single-turn (e.g. 17-bit = 131,072 counts/rev) or multi-turn (e.g. 17-bit single + 16-bit turns counter). Modern Yaskawa/Mitsubishi units run 22–24 bit. - **Resolvers** — rugged analog devices, great for high-temperature/high-vibration environments (motorsport, aerospace), lower resolution but nearly indestructible. ### The cascaded control loops The drive runs three nested loops, fastest on the inside: ``` position cmd velocity cmd torque (current) cmd | | | +----v----+ error +-----v----+ error +--------v-------+ -->| POSITION |---------->| VELOCITY |----------->| TORQUE/CURRENT |--> motor | loop | (P/PI) | loop | (PI) | loop (PI, kHz) | +----^----+ +-----^----+ +--------^--------+ | | | position fb velocity fb current fb (shunt) (encoder) (diff. of pos) (phase shunts) ``` - **Torque/current loop** runs fastest — often **8–20 kHz** — regulating motor current (hence torque, since torque = Kt × current). This is the foundation; everything above assumes it can deliver commanded torque instantly. - **Velocity loop** wraps the current loop, regulating shaft speed (PI control), typically at **1–4 kHz**. - **Position loop** is outermost, often just proportional (P) on position error feeding a velocity command, sometimes with feedforward, at **0.5–2 kHz**. You tune from the inside out: get the current loop right (usually auto-tuned to the motor's L and R), then velocity, then position. This cascade is what gives industrial servos their bandwidth, stiffness, and disturbance rejection. The same architecture appears, scaled down, inside good smart serial servos. ### Regeneration Decelerating a high-inertia load, the motor becomes a generator and pumps energy back into the DC bus. Industrial drives handle this with a **braking resistor** (dump the energy as heat) or **regenerative** circuitry (return it to the mains). Ignore this on a big inertial load and the bus over-voltage fault will trip the drive — or pop it. ## Reading a servo datasheet Datasheets are where money is won or lost. Vendors lead with the flattering numbers. Here's how to read past them. | Spec | What it means | The trap | |---|---|---| | **Stall torque** | Max torque at zero speed, max voltage, momentarily | Peak, transient. You cannot run here continuously — it's a thermal death sentence. | | **No-load speed** | Max speed with nothing on the shaft | You never operate here; any torque load drops it. | | **Rated (continuous) torque** | Torque it can hold indefinitely without overheating | The number that actually sizes your continuous duty. | | **Rated speed** | Speed at rated torque | The real operating corner of the speed-torque curve. | | **Peak torque** | Short-burst max (industrial), e.g. 3x rated for a few seconds | Limited by I²t and demag, not by mechanics. | | **Torque constant Kt** | N·m per amp of motor current | Lets you predict torque from current and vice versa. | | **Back-EMF constant Ke** | Volts per rad/s | In SI units, Ke (V/(rad/s)) = Kt (N·m/A) numerically. | | **Rotor inertia Jm** | Inertia of the spinning rotor (kg·m²) | Sets how much load inertia you can match (see sizing). | | **Rated current / peak current** | Continuous and burst current | Drive must supply peak; supply must not brown out. | | **Duty cycle / S1–S9** | How long it can run at a given load | S1 = continuous; intermittent ratings let higher torque for limited time. | | **Holding torque** | Torque to hold position statically | Still draws current and makes heat. Often near rated. | ### The speed-torque curve This is the single most informative graphic in any servo datasheet. It plots torque (x) vs speed (y), with two regions: - **Continuous operating region** — the box you live in for repetitive duty, bounded by rated torque and rated speed (thermally limited). - **Intermittent/peak region** — torque you can pull for short bursts (acceleration), bounded by peak torque, current limits, and demag. Plot your actual move profile's torque-speed points on this chart. Every point of *continuous* operation must sit inside the continuous box; *transient* peaks may enter the intermittent region. If your acceleration torque pokes outside even the peak region, the servo is too small. Full stop. ### Kt, Ke, and the unit gotcha Torque is proportional to current: `T = Kt × I`. Speed is set by voltage minus the IR drop: the motor spins until its back-EMF nearly equals the applied voltage. In SI units, **Kt (N·m/A) equals Ke (V·s/rad) numerically** — they're the same physical constant viewed from the torque side and the voltage side. The classic mistake is mixing units: a Kt given in oz-in/A and a Ke in V/kRPM look unrelated until you convert both to SI. Convert everything to N·m, A, V, and rad/s before you trust any back-of-envelope math. ``` Torque from current: T [N·m] = Kt [N·m/A] × I [A] Speed vs voltage: ω [rad/s] ≈ (V - I·R) / Ke with Ke = Kt in SI Electrical power: P_elec = V × I Mechanical power: P_mech = T × ω ``` ## Smart serial servos for robotics Smart serial servos are the reason a graduate student can build a 20-DOF humanoid without a cabinet full of industrial drives. They collapse the whole servo system into one networked module. ### What's in the box Take a Dynamixel XM430-W350 as the canonical example: a coreless or cored brushed/BLDC motor, a metal-gear reduction (e.g. ~353:1), a **contactless 12-bit magnetic encoder** (4096 positions/rev), a current sensor, a temperature sensor, a microcontroller running a cascaded PID, and a half-duplex serial transceiver — all in a roughly 35 × 28 × 46 mm case. You get back, over the wire: present position, velocity, current, input voltage, and temperature. ### The bus: TTL vs RS-485, and daisy-chaining Two physical layers dominate: - **TTL half-duplex** (Dynamixel X-series like XL/XM): a single data line shared by all devices, 3.3 V logic. Cheap, fine for short chains. - **RS-485 half-duplex** (Dynamixel higher-end and P-series): differential pair, far better noise immunity and longer runs — use it for anything beyond a benchtop. Devices **daisy-chain**: each has two connectors wired in parallel so you string them in a line. Every device has a unique **ID** (0–252; 254 is broadcast) and a **baud rate** (commonly 57,600 bps, configurable up to 4.5 Mbps on X-series). The host (a U2D2 adapter or an OpenCR/OpenRB board) is the bus master; servos only speak when addressed. ### Protocol 2.0 packet ROBOTIS Protocol 2.0 is the common language. A simplified instruction packet: ``` Protocol 2.0 instruction packet layout: Header(3) RSRV ID LEN(2) INST PARAMS... CRC(2) FF FF FD 00 01 07 00 03 74 00 C8 00... LL HH | | | | | | | fixed 0x00 ID=1 length WRITE addr+data CRC-16 INST examples: 0x01 PING 0x02 READ 0x03 WRITE 0x83 SYNC WRITE 0x92 BULK READ ``` The **Sync Write** and **Bulk Read** instructions are what make multi-joint robots practical: one packet commands position/velocity on many servos at once, or reads telemetry from many, instead of round-tripping each ID separately. On a fast bus you can update 20+ joints at hundreds of Hz. ### Operating modes and current-based torque control Modern X/P-series servos expose multiple control modes you switch by writing a register: - **Position control** — go to an angle (single-turn). - **Extended position (multi-turn)** — track position across many revolutions. - **Velocity control** — command a speed (continuous rotation). - **Current control** — directly command motor current, i.e. **torque**. This is the big one for robotics: it lets you do compliant, force-controlled motion, gravity compensation, and back-drivable joints. - **Current-based position control** — go to a position but cap the current/torque, so the joint is gentle and won't crush a finger or strip a gear. That current-limited position mode is, honestly, the killer feature. It gives you a poor-man's torque-controlled joint without the cost of a true industrial drive, and it's why these dominate research arms and grippers. > **Rule:** If you need compliant or force-aware joints on a budget, smart serial servos with current control beat both RC servos (no telemetry) and industrial drives (no money) for most sub-10-kg robots. ## Gearing and torque Almost no servo motor is used at the motor shaft. Motors make their power at high speed and low torque; joints want the opposite. The gearbox is the translator, and it dominates the servo's real-world behavior. ### Why reduction is mandatory A small motor might make 0.05 N·m at 8,000 rpm. A robot elbow wants maybe 5 N·m at 60 rpm. A **100:1 reduction** turns that 0.05 N·m into a theoretical 5 N·m (minus efficiency) and drops 8,000 rpm to 80 rpm. Torque multiplies by the ratio; speed divides by it. Reflected inertia, crucially, divides by the ratio *squared* — more on that in sizing. ### Backlash Backlash is the lost motion when you reverse direction — the gear teeth have to take up clearance before torque transmits. It's the enemy of positioning accuracy and the source of "wobble" in cheap servos. - **Spur/planetary gear RC servos:** typically **0.1–0.5°** of backlash. Fine for a camera gimbal, sloppy for a precise end-effector. - **Harmonic (strain-wave) drives:** essentially **zero backlash**, under **1 arc-minute**. The reason every precision robot wrist uses them — at a price. - **Cycloidal drives:** very low backlash, high shock tolerance, used at the heavy base joints of industrial arms. For the full gearbox treatment — strain-wave, cycloidal, planetary, and how to choose — see the [gearboxes guide](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/). ### Metal vs nylon (Karbonite) gears A perennial hobby-servo question: - **Nylon/Karbonite gears** — quiet, cheap, self-lubricating, and they **strip before they shatter** under shock. That's a feature: the gear is the sacrificial fuse protecting the motor. Good for light loads and crash-prone applications. - **Steel/titanium gears** — high torque capacity, durable under sustained load, but transmit shock straight into the motor and case. If you stall a metal-gear servo against a hard stop, the output shaft or the case mounts fail instead of a cheap gear. > **Rule:** Metal gears for sustained high torque; nylon gears when crash protection and cost matter more than ultimate strength. Don't put metal gears on a hobby airframe and assume you've upgraded — you've just moved the failure point to something more expensive. ## Control: cascaded loops, tuning, and limiting Whether it's a $9 servo or a $1,200 drive, the control law is some flavor of PID, usually cascaded. Knowing how it's structured tells you how to tune it and why it misbehaves. ### The cascade, again, and why order matters As shown earlier, the loop nests current → velocity → position. The reason for the cascade rather than one monster position-PID: each inner loop linearizes and stiffens the plant the outer loop sees. The velocity loop only works well if the current (torque) loop is fast and accurate; the position loop only works well if velocity is well-regulated. **Tune inside-out.** Tuning the position loop while the velocity loop is sloppy is chasing your tail. ### PID terms, in servo language - **P (proportional)** — stiffness. Higher P = stronger correction per unit error = stiffer joint that holds position harder. Too high → oscillation/buzz. - **I (integral)** — kills steady-state error (e.g. droop under constant gravity load). Too high or unbounded → overshoot and **integral windup**. - **D (derivative)** — damping. Resists rapid error change, calms overshoot. Too high → amplifies sensor noise into jitter. Many servo position loops are P-only on position with PI on velocity underneath — that combination handles the integral action where it belongs (velocity) and keeps the position loop clean. ### Anti-windup When a servo saturates — it's commanding max current but the load won't move (stall, hard stop, slow ramp) — the integrator keeps accumulating error it can't act on. When the obstruction clears, that stored-up integral term slams the output and you get a violent overshoot. **Anti-windup** clamps or back-calculates the integrator while saturated. Any decent servo firmware has it; if your homebrew loop overshoots wildly after a stall, this is almost always why. ### Current limiting The current limit protects the motor, the drive, and your fingers. It's set below the demagnetization and thermal limits. In smart serial servos it's a register you write (the "Goal Current" / current-limit). In industrial drives it's a torque-limit parameter, often switchable on the fly for force-sensitive operations (e.g. limit torque during a press-fit). Always set it deliberately — the default is often "as much as the hardware survives," which is not what you want crushing into an obstacle. ### Feedforward High-performance drives add **feedforward**: they predict the torque needed for the commanded acceleration (and the velocity needed for the commanded motion) and inject it directly, so the feedback loop only cleans up the residual. This dramatically improves tracking on fast, dynamic moves. It's why a well-tuned industrial servo can follow a complex trajectory with tiny following error, while a pure-feedback loop lags. ## Sizing a servo for your joint This is the section most engineers skip and most regret. Sizing on stall torque is how you end up with a servo that's "strong enough" on paper and burns out in a week. Do it properly. ### Step 1 — Reflected inertia The load inertia, seen through the gearbox, is divided by the gear ratio squared: ``` J_reflected = J_load / N² (N = gear reduction ratio) Example: J_load = 0.02 kg·m², N = 50 J_reflected = 0.02 / 2500 = 8.0e-6 kg·m² (8 µkg·m²) ``` That `N²` is why high-ratio gearboxes make big loads feel tiny to the motor — and why direct-drive (N≈1) servos must be physically huge to move any real inertia. ### Step 2 — The inertia-matching rule Compare reflected load inertia to the motor's rotor inertia `Jm`: ``` inertia ratio = J_reflected / J_motor ``` - **Ratio ≈ 1:1** — theoretically optimal power transfer, very crisp, but expensive (needs a big motor or high ratio). - **Ratio 1:1 to ~10:1** — the practical, tunable band. ~5:1 is a common, comfortable target. - **Ratio > 10:1** — the load dominates; coupling compliance and resonance make the loop hard to tune. You'll have to soften gains and accept lower bandwidth. If your ratio is 30:1, either increase the gear reduction (which cuts reflected inertia by N²) or pick a motor with higher rotor inertia. This single check prevents most "it oscillates and I can't tune it out" problems. ### Step 3 — Torque budget Sum the torques the motor must supply, reflected to the motor shaft: ``` T_motor = T_accel + T_friction + T_gravity + T_external, all referred to motor shaft T_accel = (J_motor + J_reflected) × α (α = angular accel, rad/s²) T_gravity (reflected) = T_gravity_load / (N × η) (η = gearbox efficiency) ``` Don't forget gearbox efficiency `η` (planetary ~0.9, harmonic ~0.7–0.85, worm much lower) — it makes the load *harder* to drive, so you divide by it when referring load torque back to the motor. ### Step 4 — RMS torque vs rated torque (the one that matters) A move profile isn't constant torque. You accelerate (high torque), cruise (low torque), decelerate (torque, possibly negative), and dwell (holding torque). The motor's *thermal* limit responds to the **root-mean-square torque** over the full cycle, including the dwell: ``` T_rms = sqrt( Σ(T_i² × t_i) / Σ t_i ) over accel, cruise, decel, dwell Requirement: T_rms ≤ T_rated (continuous) T_peak ≤ T_peak (intermittent rating) ``` > **The sizing rule:** Your **peak** move torque must fit under the **peak/intermittent** rating, and your **RMS** torque over the whole duty cycle must fit under the **continuous (rated)** torque. Stall torque and no-load speed don't enter the calculation at all — they're just the corners of the curve. Add a margin: target T_rms at **70–80% of rated** and T_peak at **80% of peak** to leave headroom for voltage sag, hot ambient, and friction growth as the joint wears. ## Failure modes and thermal limits Servos almost always die thermally or from a single overload event. Knowing the modes lets you design them out. ### Stall and I²t A stalled servo draws stall current — often 5–10x running current — while producing zero mechanical output, so *all* of that electrical power becomes heat in the windings. Heating goes as **I²t** (current squared times time). A brief stall is fine; a sustained one cooks the insulation and demagnetizes the magnets. Good drives and smart servos enforce an **I²t limit**: integrate I² over time and fault out before the windings exceed their thermal class. > **Rule:** Treat a stall as a fault, not an operating state. If your design ever holds a servo against a hard mechanical stop "to be sure," you're building a heater. ### Gear stripping Shock loads and stalls strip gear teeth. As noted, nylon gears strip as a sacrificial fuse; metal gears instead pass the shock to bearings, shafts, and mounts. Either way, repeated hard stops or crash impacts are the mechanical killer. Add compliance (a spring, a clutch) or current limiting upstream of a hard stop. ### Brownout / under-voltage reset The most common "ghost" failure: a servo's stall inrush sags the shared supply rail, the logic voltage dips below the microcontroller's brownout threshold, and the controller resets mid-motion. Symptoms look random and software-y but are pure power-electronics. Fix: separate rails, bulk capacitance, and adequate supply current (see wiring). ### Demagnetization Permanent magnets lose strength if exposed to a strong opposing field (from over-current) or excessive temperature beyond the magnet's grade rating. Demag is often partial and permanent: the motor's Kt drops, so it makes less torque per amp, runs hotter for the same load, and demags further — a slow death spiral. Current limits and thermal limits exist largely to prevent this. ### Duty cycle and thermal class Continuous (S1) rated torque assumes steady-state thermal equilibrium. Intermittent duty (S3, etc.) allows higher torque because the motor cools during off-time. Respect the duty rating: a servo rated for 25% duty at peak torque that you run at 60% duty will overheat even though no single move exceeds the peak number. The winding insulation class (e.g. Class B ~130 °C, Class F ~155 °C) sets the ceiling; many servos derate hard above ~40 °C ambient. ### Bearing wear and backlash growth The slow, boring failure: bearings and gear faces wear, backlash grows, the loop gets harder to tune, and accuracy drifts. Not catastrophic, but it's why a 5-year-old production line servo positions worse than a new one. Plan maintenance intervals for precision axes. ## Selection guide and comparison table Pick the world first, then the unit. Here's a decision shortcut and a real-product spec table spanning the three classes. ### Decision shortcut - **Prototyping, models, animatronics, <2 kg loads, no telemetry needed** → RC/hobby servo. Buy digital + metal gears if it holds load. - **Research robot, humanoid, gripper, arm, need current/torque feedback, 0.1–10 kg per joint** → smart serial servo (Dynamixel X/P, Feetech). - **Factory automation, CNC, packaging, high duty, high precision, fieldbus to a PLC** → industrial servomotor + drive. - **High-power, back-drivable, dynamic legged-robot joints** → consider a custom BLDC + [FOC controller](/posts/motor-controllers-foc-ultimate-guide/) (ODrive, Moteus) with an [encoder](/posts/encoders-ultimate-guide/), which is arguably a servo you assemble yourself. See the [robot actuators guide](/posts/robot-actuators-ultimate-guide/) for the full landscape. ### Real-product spec table | Product | Class | Stall/rated torque | Speed (no-load) | Feedback | Interface | Voltage | Notes | |---|---|---|---|---|---|---|---| | Futaba S3003 | RC analog | ~0.41 N·m stall @ 6 V | ~0.19 s/60° | Pot | PWM 50 Hz | 4.8–6 V | Classic cheap hobby standard | | Savöx SB-2290SG | RC digital | ~6.9 N·m stall @ 8.4 V | ~0.11 s/60° | Pot | PWM (digital) | 6–8.4 V | Brushless, steel gear, high-torque | | Dynamixel XL330-M288 | Smart serial | ~0.52 N·m stall @ 5 V | ~104 rpm | 12-bit mag | TTL, Protocol 2.0 | 3.7–6 V | Tiny, low-cost research servo | | Dynamixel XM430-W350 | Smart serial | ~4.1 N·m stall @ 12 V | ~46 rpm | 12-bit mag | TTL, Protocol 2.0 | 10–14.8 V | Current control; arm/gripper workhorse | | Dynamixel XH540-W270 | Smart serial | ~11.7 N·m stall @ 14.8 V | ~46 rpm | 12-bit mag | TTL/RS-485 | 10–14.8 V | High-torque robot joints | | Kollmorgen AKM23 | Industrial AC | ~0.9 N·m rated, ~2.8 N·m peak | ~6,000 rpm | 17–24 bit abs/resolver | EtherCAT/analog (AKD drive) | 120–240 VAC class | Continuous duty, machine axes | | Teknic ClearPath CPM-SDSK | Integrated industrial | ~0.4–3+ N·m models | up to ~6,000 rpm | Integrated encoder | Step/dir, pulse, serial | 24–75 VDC | Motor+drive+encoder in one, NEMA frames | | Maxon EC-i 40 + EPOS4 | Modular servo | ~0.1 N·m cont. (motor) | high (motor) | Encoder | CANopen/EtherCAT (EPOS4) | 24–48 VDC | Build-your-own precision servo | Numbers are representative datasheet figures and vary by exact model/winding/voltage — always pull the current datasheet for the specific part and winding before committing. ## Practical wiring and power notes More servo projects fail on power integrity than on control theory. The fixes are cheap if you design them in. ### Separate logic and motor power, common ground Never run motor current through your microcontroller's 5 V regulator. The motor's inrush and stall current will sag the rail and reset the logic. Use **two supplies**: one clean rail for logic, one beefy rail for motor power. Tie their grounds together at a single point (the servo signal is referenced to motor-power ground, but the logic must share that reference). ``` +5V logic ----[MCU / Pi]----signal---->|servo signal pin | | GND --------------+---------------------+----+----> servo GND | +7.4V motor ----------------------[bulk cap]-+----> servo V+ ``` ### Size for inrush and stall, not running current A servo's running current might be 200–500 mA, but its **inrush** (startup) and **stall** current can be several amps each. Multiply by the number of servos that might move or stall simultaneously. A 20-servo robot can pull 20–40 A peak even if it idles at 2 A. Size the supply and wiring for the worst-case simultaneous draw, or stagger startup. ### Bulk capacitance near the drive Put bulk capacitors (hundreds to thousands of µF, plus ceramics for high-frequency) close to the servos/drive to supply transient current and absorb regenerative spikes. This is the single cheapest fix for brownout resets and bus over-voltage trips on deceleration. ### Common grounds and noise PWM signal lines pick up motor noise. Keep signal wires short, route them away from motor-power conductors, and on long runs use RS-485 (differential) rather than TTL. For smart serial buses, a clean common ground across all devices is mandatory — a floating ground on one servo corrupts the whole chain. > **Rule:** If a microcontroller "randomly reboots" when motors move, stop debugging the firmware. It's a brownout. Separate the rails and add capacitance first. ### Connector and current rating Hobby servo pigtails and JST/Molex connectors are rated for modest current. Don't daisy-chain power for a dozen high-torque servos through one thin connector — distribute power with a proper bus bar or power-distribution board rated for the aggregate stall current. Melted connectors are a real and common failure. ## Frequently asked questions **What is the difference between a servo motor and a regular DC motor?** A regular DC motor is open-loop: you apply voltage and it spins, with no idea of its position. A servo motor is that motor (or a brushless/AC one) plus a position sensor and a closed-loop controller that drives the shaft to a commanded position and holds it there, correcting for load and disturbance. The motor is one of three parts; the sensor and controller are what make it a servo. **Is a servo motor AC or DC?** Both exist. Hobby and many smart serial servos use a brushed DC motor; high-end smart servos and most industrial servos use a brushless permanent-magnet machine. Industrial "AC servomotors" are three-phase PMSMs driven sinusoidally — electrically they're close cousins of what the drone world calls a BLDC motor. **How does the PWM signal control an RC servo's position?** The signal is a pulse repeated about every 20 ms (~50 Hz). The *pulse width* encodes position: roughly 1000 µs drives one extreme, 1500 µs is center, and 2000 µs is the other extreme. The servo's control board compares that commanded position against its potentiometer feedback and drives the motor until they match. The duty cycle itself carries no power — it's a position code. **Why does my servo get hot or burn out when holding a load?** Holding a static load still requires torque, which requires current, which makes heat (I²t) even though the shaft isn't moving. If the holding torque is near the servo's rated torque, or it's fighting a hard stop (stall), heat builds until the windings overheat or the magnets demagnetize. Size the servo for the *holding* torque, add a mechanical brake or counterbalance, or set a current limit. **What's the difference between analog and digital RC servos?** The mechanics are often identical; the control board differs. Analog servos drive the motor only once per ~50 Hz input frame, giving softer holding torque and slower response. Digital servos re-drive the H-bridge at 300 Hz–1 kHz+ regardless of input frame rate, giving faster response, a tighter deadband, and much stronger holding torque — at the cost of higher idle current and more heat. **What is stall torque and can I run a servo at it continuously?** Stall torque is the maximum torque a servo produces at zero speed, at max voltage, for an instant. No — you cannot run there continuously; at stall the motor draws maximum current and converts all of it to heat, so it overheats fast. Size continuous operation on the *rated (continuous)* torque, and check your *RMS* torque over the duty cycle against it. **What is the inertia matching rule and why does it matter?** Compare the load inertia reflected through the gearbox (load inertia divided by gear ratio squared) to the motor's rotor inertia. Keep the ratio roughly between 1:1 and 10:1 (about 5:1 is a comfortable target). Outside that band — especially above 10:1 — the load dominates, drive-train compliance causes resonance, and the control loop becomes hard to tune without softening gains and losing bandwidth. **Can I use a Dynamixel servo for torque or force control?** Yes. Dynamixel X and P series support current-control mode, where you command motor current directly (current is proportional to torque). They also offer current-based position control, where the servo moves to a target but caps torque. That makes compliant, force-aware, back-drivable joints possible without an expensive industrial drive — the main reason these dominate research arms and grippers. **How do I daisy-chain and address multiple smart servos?** Each servo has two parallel connectors so you wire them in a line on one shared bus (TTL or, better for noise, RS-485). Every servo gets a unique ID (0–252) and a matching baud rate, set once via the bus. A host adapter (e.g. ROBOTIS U2D2 or an OpenRB board) acts as bus master, and Sync Write / Bulk Read packets command or read many servos in a single transaction at high update rates. **Why does my microcontroller reset when the servos move?** Almost certainly a brownout. Servo inrush and stall current sag a shared power rail below the logic's reset threshold. Fix it with separate logic and motor power supplies sharing a common ground, bulk capacitance near the servos, and a supply sized for worst-case simultaneous stall current — not running current. **Metal gears or nylon gears — which should I choose?** Metal (steel/titanium) gears for sustained high torque and durability, but they transmit shock straight into the motor and mounts. Nylon/Karbonite gears are cheaper, quieter, self-lubricating, and strip as a sacrificial fuse on overload — good crash protection for light loads. Pick metal when load is high and steady; pick nylon when impacts are likely and the gear failing first is preferable to the motor or chassis failing. **What does the torque constant Kt tell me?** Kt (N·m/A) is how much torque the motor makes per amp of motor current: torque = Kt × current. In SI units it equals the back-EMF constant Ke (V·s/rad) numerically, so the same constant predicts both torque-from-current and speed-from-voltage. It lets you estimate current draw for a required torque and check it against your supply and current limit. ## Changelog - **2026-06-18** — Initial publication. --- # Linear Motion Systems: Rails, Ball Screws & Linear Motors — The Ultimate Guide URL: https://blog.robo2u.com/posts/linear-motion-systems-ultimate-guide/ Published: 2026-06-17 Updated: 2026-06-20 Tags: linear-motion, ball-screw, lead-screw, linear-rails, linear-motors, linear-guides, belt-drive, robotics-hardware, guide Reading time: 37 min > A working engineer's guide to linear motion: profile rails and recirculating-ball guides, ball/lead/roller screws, belt and rack drives, and linear motors — with preload classes, accuracy grades, life and critical-speed math, real parts, and a selection workflow. Almost every motor you bolt into a machine spins, and almost every job you actually want done is straight-line. A gantry slides a tool over a part; a Cartesian pick-and-place drops a chip onto a board; a CNC table feeds stock past a spindle; a humanoid's linear ankle pushes the foot. Somewhere between the rotor and the work, something has to turn rotation into translation — or skip rotation entirely. That something is the linear motion system, and it is where a surprising amount of a machine's real-world accuracy, speed, and stiffness gets decided. This is the long version. We'll separate the problem into its three honest subsystems — the **guide** that constrains the motion to one axis, the **drive** that supplies the force, and the **carriage** that carries the load — because mixing them up is the single most common sizing error. Then we go through each technology family with real numbers and real parts: profile rails from THK, HIWIN, Bosch Rexroth, and NSK; ball, lead, and roller screws; GT2 and HTD belts; rack-and-pinion; and ironcore versus ironless linear motors from Aerotech, Beckhoff, and the like. Numbers with units. Opinions with reasons. **The take**: For most machines in 2026, the default linear axis is a pair of profile rails plus a ground ball screw — it's stiff, ~90% efficient, accurate, and the supply chain is deep. Reach for a **belt** when the stroke is long and you care about speed more than micron accuracy; reach for **rack-and-pinion** when the stroke is measured in meters; and reach for a **linear motor** only when you genuinely need the bandwidth, acceleration, and zero-backlash directness that a screw can never give you — and you can pay for the magnets, the encoder, and the heat. Pick by the guide-drive-carriage trio as a system, never by the screw alone. Companion reading: [robot actuators](/posts/robot-actuators-ultimate-guide/), [servo motors](/posts/servo-motors-ultimate-guide/), [gearboxes (harmonic & cycloidal)](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/), and [industrial robot arms](/posts/industrial-robot-arms-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [Why linear motion is its own problem](#why) 3. [The three subsystems: guide, drive, carriage](#subsystems) 4. [Profile rails and recirculating-ball linear guides](#rails) 5. [Ball screws vs lead screws vs roller screws](#screws) 6. [Belt and rack-and-pinion drives](#belt-rack) 7. [Linear motors: ironcore vs ironless](#linear-motors) 8. [The precision / speed / force / stroke tradeoff](#tradeoff) 9. [Architectures: Cartesian, gantry, H-bot, CoreXY](#architectures) 10. [Sizing: load, moment, life, critical speed, buckling](#sizing) 11. [Accuracy, repeatability and straightness](#accuracy) 12. [Lubrication, sealing and contamination](#lube) 13. [A selection workflow](#workflow) 14. [Frequently asked questions](#faq) ## Key takeaways - **A linear axis is three subsystems, not one.** The guide constrains motion to a line and carries the moments; the drive supplies thrust; the carriage holds the load. Size each separately — a perfect ball screw bolted to undersized rails still wobbles. - **Profile rails (recirculating ball) are the workhorse guide.** Sizes 15–45 mm cover most machines, dynamic load ratings run ~10–100+ kN per block, and they carry roll/pitch/yaw moments that round shafting cannot. THK, HIWIN, Bosch Rexroth, and NSK are the four names you'll keep meeting. - **Preload class buys stiffness at the cost of friction and life.** Light preload (THK "C0", ~0) rolls free; heavy preload (~8–13% of dynamic load, e.g. THK "C1") removes deflection and adds drag and wear. Most machines want light-to-medium. - **Ball screws are ~90–95% efficient; lead screws are ~20–50%.** That single number decides motor size, heat, and whether the axis is self-locking. Ball screws also have far less wear and let you preload out backlash; lead screws are cheap and hold position with power off. - **Roller screws (planetary roller screws) are the heavy/fast option.** Many small contact lines instead of balls give them higher load capacity, higher speed, longer life, and finer leads than ball screws — at several times the price. Think Rollvis, Ewellix (SKF), GSA. - **DN value caps screw speed.** `DN = screw_diameter_mm × rpm`; stay roughly under ~70,000–100,000 for standard ball screws (recirculation and ball-train dynamics, not just whip). Critical speed and column buckling are *separate* limits you must also check. - **Belts win at long stroke and high speed; they lose at stiffness and accuracy.** A GT2 or HTD belt axis does 3–10 m/s easily over multi-meter travel, but belt stretch gives you compliance and 50–200 µm of practical positioning error unless you close the loop on the load. - **Rack-and-pinion is the meters-long answer.** It tiles to any length, handles big thrust, and runs fast, with backlash you fight using a preloaded dual-pinion or a split (master/slave) drive. Standard on gantry overhead axes and large machine tools. - **Linear motors are direct drive: zero backlash, huge acceleration, and bandwidth a screw can't touch** — but you pay in cost, heat into the machine, full-stroke feedback (a linear encoder), and the lack of any mechanical reduction or self-locking. Aerotech, Beckhoff, ETEL, Kollmorgen. - **Ironcore linear motors make more force and cog; ironless make less force, zero cogging, and zero attraction** to the track. Choose ironcore for thrust and stiffness, ironless for smoothness and constant-velocity scanning. - **Repeatability ≠ accuracy ≠ straightness.** Repeatability (return-to-same-spot) is usually 1–10× better than absolute accuracy; straightness/flatness of the rail set is a third, independent error that no controller fixes. - **Size by the worst point in the duty cycle and by L10 life, not the average.** Thrust, moment loads from offset payloads, critical speed at top rpm, and buckling at full extension are four different limits — the smallest resulting size is rarely the right one. - **Lubrication and sealing decide field life.** A starved or contaminated ball guide fails at a fraction of its catalog L10. Wipers, bellows, positive-pressure purge, and a real relube schedule are not optional on a machine that runs. ## Why linear motion is its own problem Start from the prime mover. A rotary servo or [BLDC motor](/posts/brushless-dc-motors-bldc-ultimate-guide/) makes torque and wants to spin; we cover sizing those in the [servo motors guide](/posts/servo-motors-ultimate-guide/). But a huge fraction of machine work is translation along a straight line, and there are exactly two ways to get there: 1. **Convert** rotary motion to linear with a screw, belt, rack, or cam. The motor still spins; a mechanism does the geometry. 2. **Generate** linear force directly with a linear motor — an "unrolled" rotary motor whose stator is laid flat and whose rotor becomes a moving forcer. Both have to solve the same three problems that a rotary joint mostly gets for free from its bearing: - **Constrain** the motion to one degree of freedom and reject the other five (two translations, three rotations). A spinning shaft in a bearing does this naturally; a sliding carriage does not, and the quality of that constraint *is* the linear guide. - **Carry the moment loads.** A payload is almost never on the line of thrust. Offset mass creates roll, pitch, and yaw moments that try to cock the carriage — and moment capacity, not just direct load, is what sizes a real axis. - **Supply thrust** efficiently enough that the motor and its [gearbox](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/) don't have to be absurd. > Rule of thumb: a linear axis is only as good as its weakest of {guide, drive, carriage}. Engineers over-spec the screw and under-spec the rails constantly, then wonder why the tool point shakes. The reason linear motion gets its own discipline — and its own catalogs from THK and Rexroth that are thicker than most engineering textbooks — is that all three problems interact. The guide spacing changes the moment capacity. The screw's end fixity changes its critical speed and its buckling load. The carriage's overhang changes the rail loading. You cannot size them independently and bolt them together; you size them as a system. ## The three subsystems: guide, drive, carriage Decompose every linear axis you ever build into these three parts and most of the confusion evaporates. **The guide** carries the load and constrains the motion. It is the bearing of the linear world. Options, roughly in order of stiffness and cost: - **Profile rail (recirculating ball or roller)** — the default. A hardened, ground rail and a block full of recirculating balls. Carries load and all three moments in one component. - **Round shaft + linear ball bushing** (Thomson, Igus) — cheaper, more forgiving of misalignment, but lower moment capacity and stiffness; the shaft sags over span. - **Crossed-roller / box ways** — old-school machine-tool ways and crossed-roller slides; very stiff and damped, but heavy and friction-y. - **Plain bearing / polymer slides** (Igus drylin) — dry-running, light, quiet, corrosion-proof, low cost; lower load and precision, some stick-slip. - **Wheel/cam-roller systems** (Hepco GV3, Bishop-Wisecarver DualVee) — V-guide wheels on a track; fast, debris-tolerant, long travel, lower precision. - **Air bearings** — frictionless, sub-micron straightness, used in metrology and wafer stages; expensive and need clean dry air. **The drive** turns the input (torque or current) into thrust along the axis: - **Ball screw / lead screw / roller screw** — rotary input, threaded conversion, high force, moderate speed. - **Belt (GT2, HTD, AT) / rack-and-pinion** — rotary input, long stroke, high speed, lower stiffness/accuracy. - **Linear motor** — electrical input straight to thrust, no conversion mechanism, highest bandwidth. **The carriage** is the moving structure that bolts to the guide blocks and holds the payload. Its job is stiffness and a sane center of gravity. A carriage that puts the payload far above or ahead of the guide blocks loads them in moment, and moment is what kills L10 life. The cleanest way to think about a machine is: *for each axis, choose a guide, choose a drive, choose how the carriage hangs the load, then size all three against the same duty cycle.* The rest of this guide is the menu for each slot plus the math to size it. ## Profile rails and recirculating-ball linear guides The profile rail linear guide — sometimes called a "linear guideway" or "LM guide" (THK's trademark that became generic) — is the component most machines are built around, so it earns the most ink. ### How it works A profile rail is a hardened steel beam with precision-ground raceways (usually two pairs, in a "Gothic arch" or circular-arc groove). A block (the carriage, runner block, or "bearing") rides on it with two or four rows of balls that **recirculate**: balls roll along the loaded raceway, get scooped at the end, return through a channel in the block, and re-enter. That recirculation is what gives unlimited travel — unlike a crossed-roller slide whose rollers only roll the length of the cage. The four-row "Gothic arch" geometry is the important bit: each ball contacts the groove at two points, and the four rows are oriented so the block carries load **equally in all four radial directions** (down, up, and both sides) plus all three moments — roll (Mr, about the travel axis), pitch (Mp), and yaw (My). That omnidirectional capacity is exactly what round shafting lacks. ### Sizes and ratings Profile rails come in standard widths, named by rail width in mm: **15, 20, 25, 30, 35, 45, 55, 65**. Rough capacity ladder: | Rail size | Typical dynamic load C per block | Where it lives | |---|---|---| | 15 mm | ~8–14 kN | Small Cartesian, lab automation, 3D printers (linear-rail builds) | | 20–25 mm | ~17–35 kN | Pick-and-place, light gantries, semiconductor handling | | 30–35 mm | ~35–70 kN | Machine-tool sub-axes, robot 7th-axis tracks, mid gantries | | 45 mm | ~70–110 kN | CNC axes, heavy gantries | | 55–65 mm | ~110–250+ kN | Large machine tools, press feeders, heavy structures | Two load numbers matter and they are not the same: - **Dynamic load rating C** — the load at which 90% of a population survives a nominal travel distance (THK and most metric makers use **50 km** of travel as the reference; some legacy/US specs use 100 km, so always read the basis). C drives the L10 life calculation. - **Static load rating C0** — the load that causes a defined permanent indentation (~0.0001× ball diameter total). C0 protects against standstill shock, e-stops, and clamping loads, and it sets the static safety factor `fs = C0 / applied load`. > Sizing rule: for a smooth machine use a static safety factor `fs` of about 1.5–3; for machines with vibration, impacts, or e-stops, 3–5. The dynamic rating sets *life*; the static rating sets *survival*. ### Preload classes A block can be assembled with oversized balls so the rows are loaded against each other even with no external force. This **preload** removes internal clearance and increases stiffness, at the cost of rolling friction and accelerated wear. Manufacturers sell discrete classes; THK's nomenclature is typical: | THK class | Preload (≈ % of C) | Use | |---|---|---| | C0 | ~0 (clearance to slight) | Low friction, light load, axes where smoothness beats rigidity | | C1 | ~2–5% of C | General precision machines; the common default | | C2 | ~8–13% of C | High rigidity, heavy cutting, vibration, single-rail/single-block layouts | HIWIN (ZF/Z0/ZA), Rexroth, and NSK have equivalent ladders. The tradeoffs: - **More preload → more stiffness and less deflection under load**, which matters for cutting accuracy and to keep a tool point from drooping under cantilevered mass. - **More preload → more friction and heat**, which matters for low-thrust drives (belts, small linear motors) and for back-driven or hand-loaded axes. - **More preload → shorter life** if combined with high external load, because the *effective* load on the balls is preload + external; the L10 calculation must use the combined value. Most general machines run C1/medium. Go to C2 only when you've justified the stiffness need; go to C0 when friction or smoothness dominates (e.g., a delicate ironless-linear-motor scanning stage). ### Accuracy and precision grades Separate from preload, profile rails ship in **accuracy grades** that bound the running parallelism, height tolerance, and height variation between blocks. THK's ladder, roughly: **Normal (no symbol), High (H), Precision (P), Super-precision (SP), Ultra-precision (UP)**. As you climb: - Height tolerance of the block tightens (e.g., from ±0.04 mm Normal toward ±0.005 mm UP). - **Running parallelism** of the raceway against the mounting face tightens — this is the wave you feel as the carriage travels, the source of vertical/horizontal "waviness." - Block-to-block height variation tightens, which lets you run two parallel rails without one fighting the other. > Grade rule: buy accuracy grade to match the *machine's* required straightness, and buy it on **both** rails of a parallel pair. A precision block on a normal rail, or mismatched blocks across a gantry, throws away the money you spent. Roller versions (THK SRG, Rexroth roller rail, HIWIN RG) swap balls for crossed cylindrical rollers: line contact instead of point contact gives substantially higher stiffness and load capacity for the same size, at higher cost and slightly more sensitivity to mounting flatness. Use roller rails for heavy cutting and maximum rigidity; balls for everything else. ### Products and where they show up - **THK** — invented the LM guide; SR/SHS/SSR (ball), SRG/SRS (roller/caged). The reference everyone is benchmarked against. - **HIWIN** — HG/EG/MGN series; MGN9/MGN12 are ubiquitous in hobby and small-machine builds; strong price/performance. - **Bosch Rexroth** — ball and roller rail systems, deep in machine-tool and factory automation; integrates with their actuator modules. - **NSK** — NH/NS series; strong in semiconductor and precision. - **Misumi** — sells THK-compatible and house-brand rails configured online by length; the fast path for one-off machines. - **Igus drylin** — polymer plain-bearing rails (W/T/Q series); dry, light, corrosion-proof, for washdown and low-load axes. ## Ball screws vs lead screws vs roller screws The screw is the most common rotary-to-linear drive, and the three families differ enormously. The headline numbers: | Drive | Efficiency | Backlash | Self-locking | Load capacity | Speed | Relative cost | |---|---|---|---|---|---|---| | **Ball screw** | ~90–95% | Near-zero (preloadable) | No | High | High | Medium | | **Lead (ACME/trapezoidal) screw** | ~20–50% | Yes (unless anti-backlash nut) | Often yes | Medium | Low–medium | Low | | **Planetary roller screw** | ~80–90% | Near-zero (preloadable) | No | Very high | Very high | High–very high | ### Lead screws A lead screw is a threaded rod and a nut in **sliding** contact — typically an ACME/trapezoidal thread or a polymer nut on a steel screw. The sliding friction is the whole story: - **Low efficiency (~20–50%)** means a big fraction of motor torque becomes heat, and you size the motor up accordingly. - **Self-locking** is the upside of that friction. If the lead angle is shallow enough (efficiency below ~50% in the back-drive direction), the screw holds position with the motor off — no brake needed. This is why 3D-printer Z axes, jacks, and many vertical hold-position axes use lead screws. - **Backlash** is inherent in a plain nut. Anti-backlash nuts (spring-loaded split nuts, or polymer nuts like Igus drylin) take it out at the cost of wear life and added drag. - **Cheap and quiet.** A stainless lead screw with a Delrin/Igus nut is a few dollars and needs no lubrication. For low-duty, low-load, cost-sensitive axes it's the right answer. Thomson, Nook, Misumi, and Igus all sell lead screws and anti-backlash nuts off the shelf. ### Ball screws A ball screw replaces sliding with **rolling**: hardened balls run in a matched helical groove between screw and nut, recirculating through a return tube or internal deflector, exactly like a profile-rail block wrapped around a screw. Consequences: - **~90–95% efficiency** — rolling friction is tiny, so most motor torque becomes thrust and very little becomes heat. This is the single biggest reason to choose a ball screw. - **Not self-locking** — a vertical ball-screw axis will back-drive under gravity if you cut power. Add a motor brake. - **Backlash is preloadable to near-zero.** Use an oversized-ball preload, a double-nut preload, or a lead-offset preload to remove axial play. Preload buys stiffness and zero backlash at the cost of friction and life — the same tradeoff as rail preload. - **Accuracy grades** (per JIS/DIN/ISO): from **C10/C7** (rolled, transport-grade, ±0.05 mm/300 mm class) up through **C5, C3, C1, C0** (ground, precision, down to a few µm/300 mm). Rolled screws are cheap and fine for general motion; ground screws are for positioning accuracy. THK, HIWIN, NSK, Bosch Rexroth, KSS, and Misumi cover the market. Leads (axial travel per revolution) run from ~1 mm (fine, high force, slow) to 25–50 mm (coarse, fast, lower force). ### Roller screws (planetary roller screws) A planetary roller screw replaces the balls with a set of threaded **rollers** that planet around the screw inside the nut. Many lines of contact instead of point contacts at discrete balls give: - **Much higher load capacity** for a given diameter (often 2–3×+ a ball screw), because contact is distributed across many roller threads. - **Much higher speed and acceleration** — no balls to recirculate and slam, so DN-type limits are higher; some run leads down to 1 mm at high rpm. - **Long life** — distributed contact and no recirculation impacts. - **Fine leads available** that ball screws struggle to make (e.g., 1–2 mm at high diameter). - **Efficiency ~80–90%** — a bit below ball screws because of more contact, but far above lead screws. The cost is several times a ball screw. Use roller screws where ball screws run out of headroom: electric press actuators replacing hydraulics, high-cycle servo presses, heavy fast pick-and-place, and aerospace/defense actuation. Makers: **Rollvis, Ewellix (formerly SKF), GSA, Creative Motion**. This is the technology quietly enabling the all-electric heavy actuators discussed in the [robot actuators guide](/posts/robot-actuators-ultimate-guide/). ### Speed from rpm and lead The core conversion is trivial and you should have it memorized: ``` Linear speed v (mm/s) = (motor rpm / 60) × lead (mm/rev) Linear travel per rev = lead (mm) Thrust F (N) = (2π × η × T_motor (N·m) × 1000) / lead_mm where η = screw efficiency (≈0.9 ball, ≈0.3–0.5 lead) Example: NSK ground ball screw, lead = 10 mm, motor at 3000 rpm v = (3000 / 60) × 10 = 500 mm/s = 0.5 m/s With a 1.0 N·m servo and η = 0.9: F = (2π × 0.9 × 1.0 × 1000) / 10 ≈ 565 N thrust ``` Notice the lead trades speed for force directly: halve the lead and you double the thrust and halve the speed at the same rpm. That single choice, plus the motor's torque-speed curve, sets the axis envelope. We size the rotary side of this in the [servo motors guide](/posts/servo-motors-ultimate-guide/); the gear-ratio analog of "lead" is covered in the [gearboxes guide](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/). ## Belt and rack-and-pinion drives Screws are great until the stroke gets long. Screw cost, mass, critical speed, and buckling all scale badly with length, so past roughly 1.5–3 m you switch to a belt or a rack. ### Belt drives A toothed belt over a driven pulley converts rotation to translation with the carriage clamped to the belt (or the belt fixed and the motor riding the carriage). Belt tooth profiles you'll meet: - **GT2 / GT3 (2 mm, 3 mm pitch)** — curvilinear tooth, low backlash, the standard for small/medium motion and 3D printers. GT2 is everywhere in light automation. - **HTD (3M, 5M, 8M)** — deeper curvilinear teeth, more power, used for larger axes. - **AT (AT5, AT10, AT20)** — trapezoidal, steel- or aramid-corded, very high stiffness and force for industrial linear units; the choice when a belt axis must be reasonably rigid. Why belts: - **Speed.** Belt axes routinely run **3–10 m/s** and accelerate hard, because there's no screw whip or DN limit — only pulley rpm and belt dynamics. - **Long stroke, low cost.** Travel is limited only by belt length; a 5 m belt axis is cheap next to a 5 m ground ball screw. - **Low moving mass** if the motor is stationary and only the carriage and a length of belt move. Why not belts: - **Compliance.** A belt is a spring. Belt stretch under load gives the axis a finite stiffness, which shows up as positioning error, settling time, and a resonance you must keep below your control bandwidth. - **Accuracy.** Practical positioning error of a motor-side-encoded belt axis is ~50–200 µm depending on tension, length, and load. To do better, put a linear encoder on the *load* and close the loop there. - **Tension maintenance.** Belts stretch and need re-tensioning; over-tension shortens bearing life, under-tension causes tooth skip and backlash. Bosch Rexroth, Festo, Igus (drylin ZLW), Misumi, and Bishop-Wisecarver sell complete belt-driven linear units. The handheld rule: **belt for speed and reach, screw for force and accuracy.** ### Rack-and-pinion A rack is a straight gear; a pinion on the motor output rolls along it. Racks bolt end-to-end, so the axis can be **arbitrarily long** — tens of meters on a machine-tool gantry or a robot 7th-axis track. - **Unlimited stroke** by tiling rack segments (ground racks have matched ends so the tooth pitch is continuous across joints). - **High thrust and high speed** simultaneously, limited mainly by the pinion and gearbox. - **Stiffness** far better than a belt of equal length — it's a gear mesh, not a spring. - **Backlash** is the catch. Single-pinion rack-and-pinion has gear backlash. Fixes: a **preloaded pinion** against the rack, or a **dual-pinion / electronic-preload** drive where two motors (or one split path) push against each other to take up lash, or a master/slave torque-biased pair on a servo axis. Helical racks are quieter and stronger than straight; ground racks (Güdel, Atlanta, Wittenstein, Apex) hit DIN quality grades that matter for positioning. Rack-and-pinion is the default for the long overhead axis of large gantries and for the linear track that carries an [industrial robot arm](/posts/industrial-robot-arms-ultimate-guide/) along a production line. ## Linear motors: ironcore vs ironless A linear motor is a rotary [servo/BLDC motor](/posts/brushless-dc-motors-bldc-ultimate-guide/) cut open and laid flat. The stator becomes a **track** of permanent magnets; the rotor becomes a **forcer** (the moving coil assembly) that produces thrust directly when driven with [field-oriented control](/posts/motor-controllers-foc-ultimate-guide/). There is no screw, belt, or gear — the electromagnetic force *is* the thrust. Consequences, good and bad: - **Zero backlash, zero mechanical wear path.** Nothing meshes or threads. The only wear is the guide. - **Huge acceleration and bandwidth.** Direct drive means no reflected screw inertia and no compliant transmission between motor and load. Accelerations of **5–10 g** are routine, and some short-stroke stages exceed 20 g. Settling times and bandwidth crush any screw axis. - **Smoothness** limited by cogging and force ripple, not by a nut or belt. - **No reduction.** A rotary motor + screw gives you a built-in mechanical advantage (the lead acts like a gear ratio); a linear motor has none, so it makes peak force purely from current — and from heat. - **Not self-locking.** Cut power and there's nothing holding position; vertical axes need a counterbalance or a brake. - **Feedback must be a full-stroke linear encoder** (optical or magnetic scale). There's no rotary encoder to count screw turns; commutation and position both come from the linear scale, so encoder quality directly sets your resolution and smoothness. - **Heat goes into the machine.** The forcer dissipates I²R losses right at the work zone; high-duty linear-motor stages often need liquid cooling to keep thermal growth from eating accuracy. ### Ironcore vs ironless The big architectural fork: | | Ironcore (iron-core) | Ironless (air-core / U-channel) | |---|---|---| | Coil structure | Coils wound on a laminated iron core | Coils in epoxy, no iron, between two magnet rows | | Force density | High (iron concentrates flux) | Lower for same size | | Cogging / force ripple | Present (iron teeth attract magnets) | Essentially zero | | Magnetic attraction to track | Large normal force (often > thrust) loads the guide | Zero net attraction | | Stiffness / thrust | Best | Moderate | | Best for | High-thrust, stiff, machine-tool and press axes | Smooth constant-velocity scanning, metrology, light stages | **Ironcore** linear motors make the most force per size because the iron concentrates magnetic flux — but that same iron is strongly attracted to the magnet track (a normal force that can exceed the thrust), which preloads the guide bearings and can cause cogging. Use ironcore when you need thrust and stiffness and can carry the attraction load. **Ironless** motors put the coils in epoxy with no iron, sandwiched in a U-channel of magnets. No iron means **no cogging, no force ripple, and zero net attraction** to the track — the smoothest possible motion and no extra bearing load. The price is lower force density. Use ironless for constant-velocity scanning (wafer inspection, laser machining, metrology) where smoothness beats raw force. Players: **Aerotech** (precision stages, ironless and ironcore), **Beckhoff** (AX5000/linear, and the XTS/XPlanar transport systems), **ETEL** (high-end direct drive), **Kollmorgen** (IC/ICD ironcore), **Tecnotion, LinMot** (tubular linear motors — a moving magnet rod through a stator, a clean form factor for press/insertion). Tubular linear motors deserve a note: the coil wraps fully around the magnet rod, so flux is used efficiently and there's no net side load — a nice middle ground for short-stroke, high-force insertion and pressing. > When to actually choose a linear motor: when you need acceleration or bandwidth a screw can't give, *and* the stroke is short-to-medium, *and* you can pay for the magnets, the linear encoder, the controller, and the thermal management. Otherwise a ball screw is cheaper, self-contained, and has a built-in reduction. ## The precision / speed / force / stroke tradeoff Every drive technology is a different bet on four conflicting axes: precision, speed, force, and stroke length. No technology wins all four, and the honest comparison is the most useful table in this guide: | Drive | Precision (positioning) | Top speed | Force/thrust | Practical stroke | Backlash | Efficiency | Self-locking | |---|---|---|---|---|---|---|---| | **Lead screw** | Low–medium (10–50 µm) | Low (≤0.3 m/s) | Medium | ≤1 m | Yes (or anti-backlash) | 20–50% | Often yes | | **Ball screw** | High (1–20 µm) | Medium (0.5–2 m/s) | High | ≤~3 m | Near-zero (preload) | 90–95% | No | | **Roller screw** | High (1–10 µm) | High (to ~2+ m/s) | Very high | ≤~3 m | Near-zero (preload) | 80–90% | No | | **Belt** | Low–medium (50–200 µm) | Very high (3–10 m/s) | Medium | 10+ m | Low (toothed) | ~90% | No | | **Rack-and-pinion** | Medium (20–100 µm) | High (to ~5 m/s) | Very high | Unlimited | Yes (dual-pinion preload) | ~90% | No | | **Linear motor** | Very high (<1 µm possible) | Very high (3–10 m/s) | Medium–high | Short–medium (encoder-limited) | Zero | N/A (direct) | No | Read it as a decision aid, not gospel — every cell depends on size, grade, and how you close the loop. But the shape is real: - **Want micron accuracy and high force in a compact axis?** Ground ball screw on profile rails. The default. - **Want speed and long reach?** Belt (medium reach) or rack-and-pinion (any reach). - **Want the highest dynamics and zero backlash and you'll pay for it?** Linear motor. - **Want it cheap, low-duty, and self-holding?** Lead screw. - **Want to push very hard, very fast, for millions of cycles?** Roller screw. ## Architectures: Cartesian, gantry, H-bot, CoreXY How you stack axes matters as much as which drive you pick. The common multi-axis arrangements: **Stacked Cartesian (serial XY/XYZ).** Each axis carries the next: X rides on Y rides on Z (or some order). Simple, intuitive, and every axis is independent — but the lower axes carry the **mass of all the axes above them**, including their motors. Moving mass grows fast, so dynamics suffer for the proximal axes. Standard for machine tools, dispensing, and most pick-and-place where the payload is modest. **Gantry (bridge).** A bridge spans the work and moves over it, often driven by **two parallel motors** (one per side) on the long axis. Stiff, large work envelope, and the long axis is usually rack-and-pinion or dual ball screws. The catch is **gantry skew** — the two sides must stay synchronized or the bridge racks (twists); this needs either a mechanical cross-shaft or a tuned **dual-drive gantry control** with encoders on both sides and a controller that fights yaw. Standard for large routers, laser cutters, and gantry robots. **H-bot.** A single belt routed in an "H" so that **two stationary motors** drive both X and Y; the tool head carries no motor mass. Moving X = both motors same direction; moving Y = both motors opposite. Brilliant low-moving-mass idea, but the H routing applies a **racking moment** to the gantry that the frame must resist, which limits stiffness and accuracy at speed. **CoreXY.** A refinement of H-bot with two belts crossed symmetrically so the racking moment cancels. Same benefit (two stationary motors, light head) without the H-bot's twisting load. Dominant in fast 3D printers and light gantries. The cost is belt routing complexity and the compliance of long belt loops. | Architecture | Moving mass | Stiffness | Drive typically | Best for | |---|---|---|---|---| | Stacked Cartesian | High (axes stack) | High | Ball screw / belt | Machine tools, dispensing, general | | Gantry (dual-drive) | Medium | Very high | Rack-and-pinion / dual screw | Large routers, laser, gantry robots | | H-bot | Low (head only) | Low–medium | Single belt | Fast light heads (budget) | | CoreXY | Low (head only) | Medium | Two belts | Fast 3D printers, light gantries | > Architecture rule: minimize moving mass on the fast axes and put stiffness where the tool point is. A light CoreXY head accelerates beautifully but flexes under cutting load; a stacked ball-screw machine is rigid but slow to move its proximal axes. Match the architecture to whether your job is fast-and-light or slow-and-stiff. The kinematic mapping from motor coordinates to tool coordinates (especially for H-bot/CoreXY, where motion is a linear combination of both motors) is exactly the kind of transform handled in the [motion planning & kinematics guide](/posts/motion-planning-kinematics-ultimate-guide/). ## Sizing: load, moment, life, critical speed, buckling This is where axes are won or lost. Five checks, each a separate limit, and the smallest resulting size is rarely the right one. ### 1. Load and moment on the guide Resolve the payload (including its offset from the carriage center, and dynamic forces from acceleration) into a load on each guide block: a vertical/horizontal force **plus** the three moments — roll (Mr), pitch (Mp), yaw (My). An offset or cantilevered payload dumps its weight into moment, and moment loads divide unevenly across the blocks (a two-block carriage sees one block loaded more under a pitching moment). Check the **combined load factor** the catalog specifies: ``` Load factor = P/C + Mr/Mr_rated + Mp/Mp_rated + My/My_rated ≤ 1 (where P is equivalent direct load, C the dynamic rating; must be ≤ 1, with margin, for the chosen block) ``` Then apply the static safety factor `fs = C0 / P_max` (1.5–3 smooth, 3–5 with shock). ### 2. L10 bearing life The fatigue life of a ball guide or ball screw follows the standard rolling-bearing power law. For ball elements the exponent is **3** (cube); for roller elements it's **10/3**: ``` Linear guide L10 (km) = (C / P_equiv)^3 × 50 [THK basis, 50 km reference] Ball screw L10 (rev) = (Ca / Fa_equiv)^3 × 1e6 C / Ca = dynamic load rating P_equiv = cube-mean equivalent load over the duty cycle Fa_equiv = cube-mean equivalent axial screw load Example: rail block C = 30 kN, equivalent load P = 6 kN L10 = (30/6)^3 × 50 = 125 × 50 = 6250 km of travel At 0.5 m/s and a 50% duty over 16 h/day: daily travel ≈ 0.5 × 3600 × 16 × 0.5 / 1000 ≈ 14.4 km/day L10 ≈ 6250 / 14.4 ≈ 434 days → ~1.2 years before 10% fail ``` Use the **cube-mean** load over the real duty cycle, not the peak and not the simple average — the cube weighting means high-load segments dominate. A factor-of-two load error becomes an 8× life error. ### 3. Critical speed (screw whip) A rotating screw is a shaft that whirls when its rotational speed approaches its first bending natural frequency — "whip." It depends on diameter, **unsupported length**, and end fixity (the support condition multiplier): ``` n_critical (rpm) ≈ K × f × (d_root_mm / L_mm²) × 1e7 d_root = screw root diameter (mm) L = unsupported length between bearings (mm) f = end-fixity factor (fixed-free ~0.36, fixed-supported ~1.0, fixed-fixed ~1.47, supported-supported ~1.0; values per maker) K = material constant for steel (~10 in this normalized form) Operate at ≤ 0.8 × n_critical. Example: d_root = 18 mm, L = 1500 mm, fixed-supported (f ≈ 1.0) n_crit ∝ 18 / 1500² → critical speed drops with the SQUARE of length. Doubling the length quarters the safe rpm. ``` The square-of-length dependence is the reason long ball screws hit a wall: a 3 m screw may be limited to a few hundred rpm before whip, capping your speed far below the motor's capability. The fixes are larger diameter (root diameter goes up linearly, but mass and DN go up too), better end fixity, or — the usual answer past ~2–3 m — switch to a belt or rack. ### 4. DN value (ball recirculation limit) Independent of whip, the balls themselves have a speed limit set by recirculation dynamics: ``` DN = screw_nominal_diameter_mm × rpm Standard ball screws: keep DN ≤ ~70,000 (internal return) to ~100,000+ (end-cap, high-speed nuts) ``` Exceed DN and the balls jam or wear at the return path even if you're below critical speed. High-lead and high-speed nut designs raise the limit; roller screws sidestep it entirely. ### 5. Column buckling A screw in compression (pushing a load away from the fixed bearing) can buckle like a column. The critical buckling load follows Euler, again with end fixity: ``` F_buckling (N) ≈ m × (d_root_mm)^4 / (L_mm)² × constant ∝ d_root^4 / L² (Euler column, end-fixity dependent) Operate at ≤ 0.5 × F_buckling (safety factor ~2). ``` Buckling matters at full extension on a long, slender, vertically-loaded or heavily-thrusting screw. Like critical speed, it punishes length (1/L²) and rewards diameter (here d⁴, even more strongly). If the screw must push hard at full reach, size for buckling first. > Sizing summary: run all five checks against the worst point in the duty cycle. The binding constraint moves with the job — short heavy axes are limited by load/life and buckling; long fast axes by critical speed and DN; cantilevered payloads by moment capacity. Never stop at "the thrust is enough." ## Accuracy, repeatability and straightness Three different numbers that get conflated constantly, and a controller fixes only some of them. - **Repeatability** — return to the *same* commanded position from the same direction, measured as the scatter band. Usually the best number on the spec sheet (1–10 µm for a good screw axis, sub-µm for a linear-motor stage). It's what matters for pick-and-place: hit the same spot every time. - **Accuracy (positioning accuracy)** — how close the *actual* position is to the *commanded absolute* position over the full stroke. Worse than repeatability, because it includes screw lead error, thermal growth, and Abbe error. Improved by error mapping/compensation in the controller and by closing the loop on a linear encoder. - **Bidirectional repeatability** — repeatability including *both* approach directions. This exposes **backlash and reversal error** (the lost motion when you reverse direction). A unidirectional spec hides backlash; always read whether a number is uni- or bi-directional. - **Straightness and flatness** — how much the carriage deviates from a perfect line vertically and horizontally as it travels. This comes from the **rail set and its mounting**, not the drive, and **no amount of axis control fixes it** unless you have multi-axis compensation. It's set by rail accuracy grade, mounting surface flatness, and how carefully you align the parallel rails. Two error sources worth naming: - **Abbe error** — angular error of the carriage multiplied by the offset between the measurement scale and the actual tool point. A 10 µrad pitch with a 100 mm tool offset is 1 µm of position error. Keep the feedback scale close to the work, and keep the carriage angularly stiff. - **Thermal growth** — a steel screw grows ~11 µm per meter per °C. A 1 m screw warming 5 °C from its own friction grows ~55 µm — larger than the screw's grade error. Ground-screw machines that need µm accuracy either control temperature, use a cooled hollow screw, or compensate, and many high-end machines move the feedback off the screw and onto a glass/steel linear scale precisely to dodge thermal screw growth. > Spec-reading rule: demand *bidirectional* repeatability and *full-travel* accuracy, ask what reference temperature they're at, and treat straightness as a separate line item set by the rails. A screw's "C3 grade" tells you about lead error, not about whether your gantry tracks straight. ## Lubrication, sealing and contamination The fastest way to turn a 10-year L10 axis into a one-year axis is to starve or contaminate it. This section is where field reliability actually lives. **Lubrication.** Recirculating-ball guides and ball screws need a lubricant film between ball and raceway — grease (NLGI 0–2, lithium or urea base) for most, oil for high-speed or high-temperature. Consequences of getting it wrong: - **Starvation** breaks the elastohydrodynamic film; metal-to-metal contact spalls the raceway and L10 collapses. Catalog L10 *assumes* adequate lubrication. - **Relube intervals** are specified in travel distance or hours; many blocks have grease nipples or accept an auto-luber. Honor the schedule — "lubed for life" blocks have a finite life and that life is shorter than the metal's. - **Speed/temperature** push you from grease to oil. High-speed linear-motor stages and fast ball screws often use oil-air or circulating oil. **Sealing and wipers.** Every block ships with end seals and often side/under seals; you can add **double seals, scrapers, and metal scrapers** for hard chips. Seals add friction (relevant for low-thrust belt/ironless axes) but multiply life in dirty environments. A ball screw exposed to swarf without a wiper or bellows is a wear experiment with a known bad ending. **Contamination control, by environment:** - **Machine-tool / cutting** — bellows or telescoping covers over the rails and screw; metal scrapers; positive coolant management. Chips and grit are the enemy. - **Washdown / food** — stainless or coated rails, stainless screws, food-grade grease, or go to **Igus drylin** polymer guides that run dry and tolerate water. - **Cleanroom / semiconductor** — low-particulate grease, special seals, sometimes **positive-pressure purge** (clean dry air into the carriage) to keep particles out, and ironless linear motors to avoid debris-attracting fields. - **Vacuum** — special low-outgassing lubricants and materials; this is a specialist sub-field. > Field rule: the catalog L10 is a clean-and-lubricated number. In a real dirty machine, the actual life is the catalog L10 multiplied by how seriously you took sealing and relube. Most "premature bearing failures" are lubrication or contamination failures wearing a fatigue costume. ## A selection workflow Put it together into a repeatable procedure. Work top-down; don't start by picking a screw. 1. **Define the duty cycle.** Stroke, payload (and its center-of-gravity offset), required move time / speed / acceleration, cycles per day, environment (clean, chips, washdown, vacuum), and required accuracy/repeatability. Everything downstream is sized against the *worst* point of this, not the average. 2. **Pick the architecture.** Stacked Cartesian for general work, gantry for large stiff envelopes, CoreXY/H-bot for fast light heads. This sets which axes carry which masses and where you need stiffness vs. dynamics. 3. **Choose the drive per axis** from the [tradeoff table](#tradeoff): - Short, accurate, forceful → **ball screw** (or roller screw if very high force/cycles). - Long stroke, speed matters more than microns → **belt** (to ~10 m) or **rack-and-pinion** (any length). - Highest dynamics, zero backlash, budget allows → **linear motor** (ironcore for force, ironless for smoothness). - Low-duty, cheap, self-holding vertical → **lead screw**. 4. **Choose the guide.** Profile rail (ball) is the default; roller rail for maximum stiffness/cutting; round shaft (Thomson) for misalignment tolerance; Igus drylin for dry/washdown/light; cam-roller (Hepco/Bishop-Wisecarver) for fast debris-tolerant long travel; air bearing for metrology. 5. **Size the guide:** combined load factor ≤ 1 with margin, static safety factor `fs` per environment, then **L10** in km against the cube-mean load. Pick rail size and preload class (light-to-medium default; C2 only if stiffness justified) and accuracy grade to match required straightness — **on both rails of a pair**. 6. **Size the screw (if used):** lead from the speed/force tradeoff (`v = rpm/60 × lead`), accuracy grade from required positioning, then verify **critical speed** (≤0.8× n_crit), **DN** (≤~70k–100k), **buckling** (≤0.5× F_buckling), and screw **L10** in revolutions. If any fails on a long axis, go bigger diameter, better end fixity, or switch to belt/rack. 7. **Size the motor and reduction** against the reflected inertia and the torque-speed curve — see the [servo motors guide](/posts/servo-motors-ultimate-guide/) and, if you're adding a gearhead, the [gearboxes guide](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/). Check the inertia ratio (load reflected to motor / rotor inertia) lands in a controllable range. 8. **Decide feedback.** Motor-side encoder is cheapest and fine when the transmission is stiff (ball screw); put a **linear encoder on the load** when the transmission is compliant (belt) or when you need accuracy beyond the screw's lead error and thermal growth. 9. **Specify sealing, lubrication, and covers** for the environment, and write the relube schedule into the maintenance plan. This is the difference between catalog L10 and field L10. 10. **Prototype and measure** bidirectional repeatability, full-travel accuracy, and straightness on the real machine. The spec sheet is a starting point; the assembled, mounted, loaded axis is the truth. Follow that order and you'll avoid the classic failures: the over-spec'd screw on under-spec'd rails, the belt axis that can't hold position, the long ball screw that whips at half its target speed, and the beautiful linear-motor stage that cooks itself because nobody planned the cooling. ## Frequently asked questions **When should I use a ball screw versus a linear motor?** Default to a ball screw: it's cheaper, self-contained, has a built-in mechanical reduction (the lead), and a single rotary encoder closes the loop. Reach for a linear motor only when you need acceleration or bandwidth the screw can't deliver, the stroke is short-to-medium, backlash must be truly zero, and you can pay for the magnet track, the full-stroke linear encoder, the drive, and the thermal management. Most machines never cross that threshold. **Why are ball screws so much more efficient than lead screws?** Rolling versus sliding. A ball screw's load rides on recirculating balls (rolling friction, ~90–95% efficient); a lead screw's nut slides directly on the thread (sliding friction, ~20–50%). The flip side is that lead-screw friction makes the screw self-locking, so it holds a vertical load with power off — which a ball screw won't do without a brake. **What is preload and why does it matter on both rails and screws?** Preload is built-in internal load (oversized balls, or a double nut loaded against itself) that removes clearance so the element is stiff and backlash-free even at zero external load. The cost is friction, heat, and shorter life, because the balls see preload *plus* external load. Use light-to-medium preload by default; go heavy only when you've justified the stiffness, and use light/zero preload for low-friction or smooth-scanning axes. **What does the DN value limit, and how is it different from critical speed?** DN (`diameter_mm × rpm`) limits the *balls'* recirculation dynamics — exceed it and balls jam or wear at the return path. Critical speed is a *shaft* phenomenon — the screw whirls when its rpm nears its bending natural frequency, which scales with 1/length². They're independent: a short fat screw can be DN-limited while a long thin one is critical-speed-limited. Check both, plus buckling. **How long should a profile rail or ball screw last?** It's an L10 fatigue number: 90% of a population survive the calculated travel (rails, in km against a 50 km basis) or revolutions (screws). Computed from the cube-mean load over your real duty cycle, good axes reach years of operation. But the catalog L10 assumes clean and lubricated — starvation or contamination can cut field life to a fraction, so most "early failures" are really lube/seal failures. **Belt or ball screw for a long horizontal axis?** If the stroke is past roughly 1.5–3 m and you care more about speed than microns, use a belt — it avoids the screw's critical-speed and buckling penalties (both ~1/length²) and runs 3–10 m/s cheaply. If you need micron positioning and high stiffness over that length, a belt won't give it; either accept a large ball screw with good end fixity or close the loop on a load-side linear encoder. Past a few meters, rack-and-pinion beats both. **What's the difference between ironcore and ironless linear motors?** Ironcore coils are wound on iron, giving high force density but cogging and a strong magnetic attraction to the track that preloads the guide. Ironless coils sit in epoxy with no iron — no cogging, no force ripple, zero net attraction, the smoothest motion possible — but lower force density. Choose ironcore for thrust and stiffness, ironless for smooth constant-velocity scanning and metrology. **Why does my machine hit the right position repeatably but the wrong absolute coordinate?** That's the difference between repeatability and accuracy. Repeatability (returning to the same spot) is set by the mechanics' consistency; absolute accuracy adds screw lead error, thermal growth (~11 µm/m/°C for steel), and Abbe error. Fix accuracy with controller error mapping or by moving feedback to a load-side linear scale. Repeatability you mostly buy in the hardware. **Do I need a linear encoder, or is the motor encoder enough?** A motor-side encoder is fine when the transmission between motor and load is stiff and low-backlash — a ground ball screw qualifies. Put a linear encoder on the load when the transmission is compliant (belts stretch, long screws wind up) or when you need accuracy beyond the screw's lead error and thermal growth. The encoder also dodges thermal screw growth by measuring the actual carriage, not the screw turns. **What causes gantry skew and how do I prevent it?** On a dual-driven gantry, the two sides driving the long axis can get out of sync and twist the bridge (racking it about the vertical axis). Prevent it with a mechanical cross-shaft tying both sides, or — more common now — a dual-drive servo control with an encoder on each side and a controller term that actively cancels yaw. Without one of those, the bridge binds and the position error grows with how far the sides drift. **When is rack-and-pinion the right call over a screw or belt?** When the stroke is measured in meters and you need both speed and high thrust with more stiffness than a belt — large gantries, machine-tool long axes, and the linear track that carries a robot arm down a line. Racks tile end-to-end for unlimited length. Fight the gear backlash with a preloaded pinion or a dual-pinion (electronic-preload) drive. **Can polymer plain bearings (Igus drylin) replace ball rails?** For the right job, yes. Drylin runs dry (no lube), is light, quiet, corrosion-proof, and cheap, and it shrugs off washdown and dust that destroy ball guides. The tradeoffs are lower load capacity and stiffness, some stick-slip, and a wear allowance instead of a fatigue life. Use it for light, low-precision, dirty, or wet axes; keep ball rails for load, stiffness, and µm accuracy. ## Changelog - **2026-06-17** — Initial publication. --- # Brushless DC Motors (BLDC) for Robotics: The Ultimate Guide URL: https://blog.robo2u.com/posts/brushless-dc-motors-bldc-ultimate-guide/ Published: 2026-06-16 Updated: 2026-06-20 Tags: bldc, brushless-motors, pmsm, motors, kv-rating, esc, foc, drone-motors, robotics-hardware, guide Reading time: 34 min > A robotics engineer's deep dive into brushless DC motors: Kv vs Kt, trapezoidal vs FOC commutation, sensored vs sensorless, gimbal/QDD actuators, datasheet math, and how to size a BLDC for a robot joint or drone. A brushless DC motor is the part of your robot that turns electrons into torque. Everything upstream of it — the battery, the ESC, the FOC controller, the encoder — exists to feed it correctly. Everything downstream — the gearbox, the linkage, the wheel or the leg — exists because the raw motor by itself almost never matches the load. Get the motor wrong and no amount of clever control firmware saves you. Brushless DC (BLDC) motors are now the default for almost everything that moves under power in modern robotics: drone props, quadruped legs, robot-arm joints, e-bike hubs, gimbals, and the direct-drive wheels on warehouse AMRs. The reason is simple — you removed the one part of a brushed motor that wears out (the commutator and brushes) and moved commutation into silicon, which gets cheaper and smarter every year. **The take**: The two numbers that decide whether a BLDC fits your robot are its Kv rating (RPM per volt, which is just the inverse of its torque constant Kt) and its continuous thermal limit (how much current you can push before the windings cook). Everything else — pole count, sensored vs sensorless, six-step vs FOC, inrunner vs outrunner — is a consequence of those two constraints and the load you're driving. Pick a low-Kv motor when you want torque at low speed (robot joints, legs), a high-Kv motor when you want speed (props, wheels), and let the controller and gearbox close the gap. If you remember nothing else: low Kv = high torque per amp, and the continuous current rating is a thermal number, not a magnetic one. Companion reading: [servo motors](/posts/servo-motors-ultimate-guide/), [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/), [encoders](/posts/encoders-ultimate-guide/), and [robot actuators](/posts/robot-actuators-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [What a BLDC is and why brushless won](#what-is-bldc) 3. [BLDC vs PMSM vs brushed DC vs stepper](#motor-types) 4. [Motor anatomy: stator, rotor, poles and slots](#anatomy) 5. [The Kv rating, decoded](#kv-rating) 6. [Electronic commutation: six-step vs FOC](#commutation) 7. [Rotor position sensing: Halls, encoders, sensorless](#position-sensing) 8. [Reading a BLDC datasheet](#datasheet) 9. [Torque, speed, power and the motor curve](#motor-curve) 10. [Gimbal motors, direct-drive and QDD actuators](#gimbal-qdd) 11. [Drone propulsion BLDCs vs robot-joint BLDCs](#drone-vs-joint) 12. [Cooling, thermal management and duty cycle](#thermal) 13. [Selecting a BLDC for a robot](#selection) 14. [Frequently asked questions](#faq) ## Key takeaways - A BLDC replaces the mechanical commutator of a brushed motor with electronic commutation in an ESC or FOC controller. No brushes means no brush wear, no sparking, less EMI, and lifetimes set by bearings, not by a wearing carbon block. - **Kv (RPM/V) is the inverse of the torque constant Kt (N·m/A).** A high-Kv motor spins fast but makes little torque per amp; a low-Kv motor spins slow but makes lots of torque per amp. They are the same number wearing different units. - Typical BLDC electrical efficiency is 80–90% at the design point; large industrial servomotors hit 90–94%, tiny drone motors at full throttle often drop into the 70s. - "BLDC" and "PMSM" describe nearly the same hardware. The honest distinction is the back-EMF waveform (trapezoidal vs sinusoidal) and how you commutate it (six-step vs FOC), not two different motor species. - The continuous current/torque rating is a **thermal** limit, not a magnetic one. Peak ratings (often 2–4× continuous) are valid only for seconds before the windings overheat. - Electrical speed = mechanical speed × pole pairs. A 14-pole (7 pole-pair) drone motor turns its field 7× faster than the shaft, which is why high-pole-count motors stress ESC commutation timing. - FOC (field-oriented control) gives smooth torque, full torque at zero speed, and quiet operation. Six-step/trapezoidal is simpler and fine for props that always spin fast. Use FOC for joints, six-step is acceptable for propulsion. - Sensored control (Hall sensors or an encoder) is mandatory for smooth low-speed and zero-speed torque. Sensorless back-EMF estimation is cheap and fine above a few hundred RPM but cannot hold a joint still under load. - Outrunners (rotating can) give high torque and low Kv in a short package — ideal for props and gimbals. Inrunners (rotating inner shaft) give high speed and low rotor inertia — ideal for geared joints and tools. - Quasi-direct-drive (QDD) actuators — a low-Kv gimbal-style motor plus a 6:1–10:1 single-stage planetary and FOC — are why agile legged robots exist. They give torque density with backdrivability and torque sensing without a load cell. - Real parts worth knowing: Maxon EC/ECX (industrial), T-Motor and iPower (drone/gimbal), KDE Direct (heavy-lift props), ODrive and mjbots moteus (open FOC drives for robot joints), Hobbywing (drone/RC ESCs). - Size the motor from the load's torque-speed point plus a thermal margin, pick voltage to land Kv·V near your top speed, then choose the sensor based on how slow you need usable torque. ## What a BLDC is and why brushless won A brushed DC motor puts the magnets on the outside (stator) and the windings on the spinning rotor. To keep torque pointing the right way as the rotor turns, it uses a mechanical commutator: a segmented copper ring on the shaft, wiped by spring-loaded carbon brushes that physically switch which coil is energized. Elegant, self-contained, and the source of every problem brushed motors have. A BLDC flips the topology. The permanent magnets go on the rotor, the windings go on the stationary stator, and there is no commutator at all. Instead, an external controller — an ESC or a FOC drive — energizes the stator coils in sequence, electronically, by reading or estimating where the rotor is. The motor is "brushless" because the commutation moved out of the motor and into silicon. That single change buys a lot: - **No brush wear.** Brushed motors die when the brushes wear down — typically a few hundred to a couple thousand hours of continuous duty. A BLDC's lifetime is set by its bearings, which can run tens of thousands of hours. - **No sparking.** Brush commutation arcs. That arcing is electrical noise (EMI), a fire risk in dusty or flammable environments, and a hard no for vacuum or explosive atmospheres. BLDCs don't arc. - **Better power density.** Putting the windings on the stationary outer body means you can conduct heat out of the windings directly into the housing instead of trapping it in a spinning rotor. So you can push more current through a smaller motor. - **Higher efficiency.** No brush friction, no commutator IR losses. A good BLDC runs 80–90% efficient; the brushed equivalent loses several points to brush drag and contact resistance. - **Cleaner control.** Because commutation is electronic, you can do field-oriented control, regenerative braking, precise torque control, and silent operation — none of which a brushed commutator can do well. The cost is that a BLDC is useless without its controller. A brushed motor runs off a battery and a switch. A BLDC needs three half-bridges, gate drivers, current sensing, and firmware that knows the rotor angle. That complexity used to be expensive; in 2026 a capable FOC drive costs less than the motor it controls, which is exactly why brushless won. > Rule: if a motor will run more than a few hundred hours, or needs precise torque, or runs near anything flammable, it should be brushless. The only reason to still spec a brushed motor in 2026 is cost on a throwaway toy. ## BLDC vs PMSM vs brushed DC vs stepper Engineers argue about "BLDC vs PMSM" more than the distinction deserves. Physically they are almost the same machine: three-phase stator windings, permanent-magnet rotor, electronic commutation. The real difference is two things — the shape of the back-EMF waveform, and how you choose to drive it. **Back-EMF** is the voltage a spinning motor generates on its own terminals. Its waveform shape is set by how the windings and magnets are arranged: - **Trapezoidal back-EMF** → conventionally called **BLDC**. The waveform has flat tops. It's a natural fit for six-step (trapezoidal) commutation, where you energize two of three phases at a time. Concentrated windings produce this. - **Sinusoidal back-EMF** → conventionally called **PMSM** (permanent magnet synchronous motor). Distributed windings and shaped magnets produce a clean sine. This is what FOC wants. In practice the line is blurry. Most "drone BLDC" motors have a back-EMF that's neither perfectly trapezoidal nor perfectly sinusoidal, and modern FOC controllers drive them sinusoidally regardless. So when someone runs a "BLDC" motor under FOC, they are operating it as a PMSM. The marketing label on the box rarely matches the control strategy. Here's how the four common DC-ish motor types compare for robotics: | Property | Brushed DC | Stepper | BLDC (trapezoidal) | PMSM (sinusoidal) | |---|---|---|---|---| | Commutation | Mechanical | Open-loop step sequence | Electronic, 6-step | Electronic, FOC | | Controller needed | Switch / H-bridge | Step driver | ESC | FOC drive | | Position feedback | None required | None (open-loop) | Halls or sensorless | Encoder (usually) | | Torque ripple | Moderate | High (cogging + steps) | Moderate (commutation notches) | Low (smooth) | | Pole count | Low | Very high (50–200) | Low–moderate (4–28) | Low–moderate | | Peak efficiency | 70–80% | 50–70% | 80–90% | 85–94% | | Torque at zero speed | Yes (stalls hot) | Yes (holding torque) | Only if sensored | Yes (full torque) | | Best robotics use | Toys, cheap drives | Cheap precise positioning (3D printers) | Props, wheels, fans | Joints, legs, servos, gimbals | A stepper is technically a multi-pole brushless machine too, but it's driven open-loop by stepping through known positions. It gives you cheap precise positioning without an encoder (hence 3D printers), at the cost of efficiency, noise, and the ever-present risk of losing steps under load. For dynamic robotics you almost always want a true BLDC/PMSM with feedback instead. > Rule of thumb: if your control strategy is FOC, call it a PMSM in your head and stop worrying about whether the datasheet says "BLDC." Spec the back-EMF constant (Ke) and the resistance/inductance; the marketing label doesn't change the math. ## Motor anatomy: stator, rotor, poles and slots ### Stator and windings The stator is the stationary iron core carrying the copper windings. The iron is built from thin (typically 0.2–0.5 mm) laminations of silicon steel, stacked and insulated from each other. Lamination is not optional — a solid iron core would let eddy currents circulate and turn your motor into a space heater. Thinner laminations mean lower eddy losses and matter more at high electrical frequency (high-speed or high-pole-count motors). The windings are wound around stator teeth (the "slots"). More copper, thicker wire, and a higher fill factor mean lower phase resistance and less I²R loss. This is why a "premium" motor that looks identical to a cheap one can run cooler at the same load: the winding is just better packed. ### Rotor and magnets The rotor carries the permanent magnets. Almost all serious BLDCs use sintered neodymium-iron-boron (NdFeB) magnets for their high energy density. The magnet grade and temperature rating matter: cheap N35 magnets start losing flux (and your motor loses torque) above ~80 °C, while high-temp grades (N42SH, N45UH) hold up past 150 °C. A drone motor that "loses power when hot" is often demagnetizing its rotor, and that damage is permanent. ### Poles and slots Pole count = number of magnetic poles on the rotor (always even). Slot count = number of stator teeth. They're written together, e.g. **12N14P** (12 stator slots, 14 rotor poles) — a common drone-motor layout. Pole pairs = poles ÷ 2. This number is the conversion factor between mechanical and electrical speed: ``` electrical_frequency_Hz = (mechanical_RPM / 60) * pole_pairs electrical_speed = mechanical_speed * pole_pairs ``` A 14-pole (7 pole-pair) outrunner spinning at 6,000 RPM mechanical is generating electrical fundamentals at (6000/60)·7 = 700 Hz. The ESC has to commutate at that rate — that's why high-pole-count motors stress cheap ESCs and why drone ESCs advertise high "eRPM" limits. **Why high pole count?** More poles → more torque per amp at low speed (lower Kv) and smoother running, but a higher electrical frequency for a given shaft speed, which raises iron losses and commutation demands. Gimbal and direct-drive joint motors lean into high pole counts (often 14–28 poles) for exactly this reason — they want torque, not top speed. ### Inrunner vs outrunner - **Inrunner**: magnets on an inner rotor, windings on the outer stator, shaft spins fast. Low rotor inertia, high Kv, high speed. Used for tools, geared joints, EDF fans, and RC car motors. The outer can is the heatsink, so they cool well. - **Outrunner**: the outer "can" rotates and carries the magnets; windings are on a fixed inner stator. High torque, low Kv in a short, fat package. Larger air-gap radius means more torque per volume. Used for direct-drive props, gimbals, and QDD joints. The downside is the spinning can traps heat and has high inertia. > Rule: outrunner for direct-drive torque (props, gimbals, QDD legs), inrunner for high-speed-then-gear-it-down (tools, EDFs, some industrial servos). The air gap — the tiny radial clearance between rotor and stator, often 0.3–1 mm — should be as small as the bearings and tolerances allow; every extra 0.1 mm of air gap costs you flux and torque. ## The Kv rating, decoded Kv is the single most misunderstood spec on a BLDC. It is **not** a quality rating and it is **not** kilovolts. Kv is the motor velocity constant, in **RPM per volt**, measured at no load: ``` no_load_RPM ≈ Kv * V_applied (no load, ignoring losses) ``` A 900 Kv motor on a 4S LiPo (≈14.8 V nominal) spins roughly 900 × 14.8 ≈ 13,300 RPM unloaded. Under load it spins slower, because current through the winding resistance drops voltage and the motor needs back-EMF headroom to push current. ### Kv is the inverse of the torque constant Here's the relationship every robotics engineer should have memorized. The torque constant Kt (N·m per amp) and the back-EMF constant Ke (V per rad/s) are numerically equal in SI units, and both are tied to Kv: ``` Kt [N·m/A] = 60 / (2 * pi * Kv) # when Kv is in RPM/V Kt [N·m/A] ≈ 9.549 / Kv Kt [N·m/A] = Ke [V·s/rad] # SI: torque const = back-EMF const ``` So a **900 Kv** motor has Kt ≈ 9.549 / 900 ≈ **0.0106 N·m/A**. Push 20 A through it and you get roughly 0.21 N·m (minus losses). A **90 Kv** motor — ten times lower — has Kt ≈ 0.106 N·m/A, ten times the torque per amp, at one tenth the speed per volt. That's the whole story of why **low Kv = high torque**: it isn't two separate properties, it's one constant viewed two ways. A motor that spins slowly per volt necessarily produces more torque per amp, because the same back-EMF that limits speed is the same physics that converts current to torque. ### Why this matters for picking a motor - **Drone props** want speed → high Kv (typically 900–2700 Kv for 5-inch quads on 4S–6S). - **Heavy-lift / large props** want low Kv to swing big slow props → 100–400 Kv. - **Robot joints / legs** want torque at low speed → very low Kv (50–200 Kv gimbal-style), then a small gear reduction. - **Battery voltage and Kv trade off.** You can get the same top speed from a high-Kv motor on a low-voltage pack or a low-Kv motor on a high-voltage pack. Higher voltage means lower current for the same power, which means thinner wires and lower I²R losses — one reason robot drives are creeping from 24 V to 48 V. > Rule: choose Kv so that Kv × (pack voltage) lands ~10–20% above your required top speed, leaving headroom for the voltage lost across winding resistance under load. Then check that the current needed for your torque (I = τ / Kt) stays under the motor's continuous rating. ## Electronic commutation: six-step vs FOC Commutation is the act of switching which stator phases are energized so the magnetic field stays ahead of the rotor and keeps pulling it around. There are two dominant strategies. (For the full controller-side treatment, see the [motor controllers & FOC guide](/posts/motor-controllers-foc-ultimate-guide/).) ### Six-step / trapezoidal commutation The classic, simple method. The three phases are switched through six discrete states per electrical cycle; at any instant two phases conduct and one floats. You only need to know which 60° sector the rotor is in — six coarse positions, which Hall sensors or back-EMF zero-crossings provide directly. - **Pros**: dead simple, cheap, robust, low compute. Most hobby drone ESCs do exactly this (often with BLHeli or AM32 firmware). - **Cons**: torque ripple at the commutation steps (you feel six "notches" per electrical revolution), audible whine, and poor smoothness at low speed. Fine when the motor always spins fast (props), bad when you need a clean hold or slow precise motion. ### Field-oriented control (FOC) / sinusoidal FOC continuously computes the rotor angle and drives all three phases with smoothly varying sinusoidal currents, using the Clarke and Park transforms to decompose phase currents into a torque-producing component (Iq) and a flux component (Id). You command torque directly by commanding Iq, and the controller keeps Id ≈ 0 (or negative for field weakening at high speed). - **Pros**: smooth torque with minimal ripple, full torque at zero speed, quiet, efficient, enables torque control and regenerative braking. This is what robot joints need. - **Cons**: needs accurate rotor angle (encoder or good sensorless estimator), more compute, and current sensing on at least two phases. | | Six-step / trapezoidal | FOC / sinusoidal | |---|---|---| | Position resolution needed | Coarse (60° sectors) | Fine (continuous angle) | | Torque smoothness | Notchy, ~6 ripples/cycle | Smooth | | Torque at zero speed | Poor | Full | | Compute | Low (8-bit MCU fine) | Moderate (needs FPU / fast MCU) | | Audible noise | Whine | Quiet | | Typical use | Drone/RC props, fans | Robot joints, gimbals, servos, EVs | | Example controllers | Hobbywing, BLHeli/AM32 ESCs | ODrive, mjbots moteus, Maxon EPOS, VESC | > Rule: props and wheels that live above a few hundred RPM are fine on six-step. Anything that must hold position, move slowly, or deliver clean torque (joints, legs, gimbals, steering) needs FOC. In 2026 there's little reason not to use FOC except cost and compute on the very smallest drives. ## Rotor position sensing: Halls, encoders, sensorless Commutation needs to know where the rotor is. There are three ways to find out, and the choice drives your low-speed performance and your BOM cost. For the full treatment of feedback devices, see the [encoders guide](/posts/encoders-ultimate-guide/). ### Hall-effect sensors Three Hall sensors spaced 120° (electrical) report which 60° sector the rotor is in. Cheap, robust, and good enough for six-step commutation and FOC startup. - **Pros**: works from zero speed, cheap (~cents each), tolerant of dirt and temperature. - **Cons**: only 6 states per electrical cycle — too coarse for smooth FOC by themselves, so they're often used only to bootstrap, then handed off to sensorless or an encoder. Hall misalignment causes commutation timing errors. ### Encoders (absolute / incremental) A magnetic (e.g. AS5047, AS5048) or optical encoder gives continuous high-resolution angle — 12 to 14+ bits (4096–16384 counts/rev). This is what good FOC drives use. mjbots moteus and ODrive both rely on magnetic absolute encoders mounted on the rotor. - **Pros**: continuous angle for smooth FOC, full torque at zero speed, accurate position control, enables torque estimation. Absolute encoders know position at power-on without homing. - **Cons**: cost, the need for precise mounting and electrical-angle calibration, and a magnetic encoder needs a diametric magnet on the shaft end. ### Sensorless (back-EMF estimation) The controller infers rotor angle from the motor's own back-EMF — either by watching the floating phase's zero crossing (six-step) or by running a flux/angle observer (FOC). No sensor hardware at all. - **Pros**: zero added cost and wiring, no sensor to fail, smaller motor. Standard on drone ESCs. - **Cons**: back-EMF is proportional to speed, so it **vanishes near zero speed**. Sensorless motors must be "kicked" through an open-loop startup ramp, and they cannot hold position or deliver smooth torque at standstill under load. Useless for a robot joint that must hold against gravity; perfect for a prop that's always spinning. | Sensing | Zero-speed torque | Cost | FOC smoothness | Typical use | |---|---|---|---|---| | Hall sensors | Yes (coarse) | $ | OK for startup | Industrial six-step, FOC bootstrap | | Encoder (magnetic/optical) | Yes (full) | $$–$$$ | Excellent | Robot joints, servos, QDD | | Sensorless back-EMF | No | Free | Good above ~5–10% speed | Drone props, fans, pumps | > Rule: if the motor must produce torque at or near zero speed (any joint, any leg, any steering), you need an encoder (or at minimum Halls). If it always spins fast and free (props, fans), go sensorless and save the part. ## Reading a BLDC datasheet Half of motor selection is just reading the datasheet correctly. Hobby motors give you Kv, weight, and a thrust table. Industrial motors (Maxon, Faulhaber, Nanotec) give you the real electrical and thermal parameters. Here's the glossary that matters. | Spec | Symbol / units | What it means | Why you care | |---|---|---|---| | Velocity constant | Kv [RPM/V] | No-load speed per volt | Sets top speed; inverse of Kt | | Torque constant | Kt [N·m/A] (or mN·m/A) | Torque per amp | τ = Kt · I; sets current for your load | | Back-EMF constant | Ke [V/(rad/s)] or [V/kRPM] | Generated voltage per speed | Numerically = Kt in SI; sets voltage headroom | | Rated (nominal) voltage | V [V] | Design voltage | Pairs with Kv for expected speed | | Phase resistance | R [Ω or mΩ] | Winding resistance (often phase-to-phase) | I²R loss and heat; voltage drop under load | | Phase inductance | L [µH or mH] | Winding inductance | Sets current ripple, needed PWM frequency, FOC tuning | | Continuous current | I_cont [A] | Max current you can run indefinitely | **Thermal limit** — the real working number | | Peak current | I_peak [A] | Max current for seconds | 2–4× continuous; valid only briefly | | Continuous torque | τ_cont [N·m] | = Kt · I_cont | Your real usable torque | | Peak / stall torque | τ_peak [N·m] | Short-burst torque | For acceleration, not steady state | | No-load current | I_0 [A] | Current to spin the motor unloaded | Bearing + iron + windage losses | | Thermal resistance | R_th [K/W] | Temp rise per watt of loss | How fast it heats up; sets duty cycle | | Max winding temp | T_max [°C] | Insulation / magnet limit | Often 100–155 °C; exceed it and you demagnetize | | Pole count / pole pairs | — | Magnetic poles | Sets electrical frequency vs RPM | ### The traps - **Resistance is often quoted phase-to-phase**, which for a wye (star) winding is 2× the per-phase value. Get this wrong and your loss math is off by 2×. - **Peak ratings are marketing-adjacent.** A drone motor rated "60 A peak" may only sustain 25 A continuous before the windings exceed 100 °C. The peak number is for the few seconds of a punch-out, not for a hover. - **Continuous current is a thermal number tied to cooling assumptions.** The same motor mounted on a big aluminum plate with airflow can run far more continuous current than one wrapped in a 3D-printed bracket. The datasheet figure assumes a specific heatsink; your install may be worse. - **Kv tolerance is ±5–10%** on hobby motors. Two "900 Kv" motors from the same batch can differ enough to matter for a multirotor needing matched thrust. > Rule: design to the continuous rating, treat peak as a transient acceleration budget you can spend for a few seconds, and always derate the datasheet's continuous current for your actual (usually worse) cooling. ## Torque, speed, power and the motor curve A DC motor's behavior is captured by a torque-speed curve. For an idealized BLDC at fixed voltage: ``` speed: ω = Kv * V - (R / Kt^2) * τ # speed droops linearly with torque torque: τ = Kt * I # torque is proportional to current power: P_mech = τ * ω # peaks near the middle of the curve ``` At no load, the motor spins at ≈ Kv·V and draws only I_0. As you load it, speed droops linearly and current rises. At stall, speed is zero, torque is maximum, and current is V/R — which is huge and will instantly cook a small motor. **Mechanical power output peaks somewhere in the middle**, at roughly half the no-load speed and half the stall torque. But the motor curve is the *electromagnetic* capability. It is not your operating envelope. Your operating envelope is set by **heat**. ### The thermal limit is the real constraint Copper loss is I²R. Double the current and you quadruple the heat. The continuous current rating is simply the current at which steady-state winding temperature settles at the insulation limit (often 100–155 °C) given the motor's thermal resistance R_th and ambient. ``` T_winding ≈ T_ambient + P_loss * R_th P_loss ≈ I^2 * R (+ iron and friction losses) ``` So the **continuous operating point** lives well below the stall and even below the peak-power point. Operating above continuous is allowed only for short bursts, governed by the motor's thermal time constant (seconds for a tiny drone motor, minutes for a big servomotor with iron mass). This is the whole game in robotics actuator sizing: a motor that can momentarily deliver 5 N·m of peak torque to absorb an impact might only sustain 1.5 N·m continuously. If your robot leg needs 2 N·m continuously, that motor is too small even though it "hits 5 N·m." > Rule: size for the continuous (RMS over the duty cycle) torque, then verify the peak is covered for the worst transient. Heat is the limit, not the torque-speed curve. ## Gimbal motors, direct-drive and QDD actuators This is the section that explains modern legged robots, and it's worth understanding deeply. See also the [robot actuators guide](/posts/robot-actuators-ultimate-guide/) and the [legged/quadruped hardware guide](/posts/legged-quadruped-robot-hardware-ultimate-guide/). ### Gimbal motors A gimbal motor is a low-Kv outrunner (often 50–200 Kv) originally designed to slowly and smoothly stabilize a camera. Low Kv means high Kt — lots of torque per amp at low speed — and the high pole count gives smooth, fine motion. iPower and T-Motor sell these by the hundreds. The robotics community noticed something: a gimbal motor driven by FOC is a near-ideal direct-drive torque source. It makes meaningful torque at zero speed, it's smooth, it's backdrivable, and — critically — because torque ≈ Kt · Iq, you can **estimate output torque from current** without a torque sensor. ### Direct drive vs quasi-direct drive (QDD) **Direct drive** means the motor connects to the load with no gearbox. Maximum backdrivability, zero gear lash, transparent force control, no gear noise — but you need a big, heavy motor to get useful torque, because the motor alone makes modest torque. Used in some haptics and a few specialized joints. **Quasi-direct drive (QDD)** is the compromise that changed legged robotics: a low-Kv high-torque motor plus a **small, single-stage planetary gearbox (typically 6:1 to 10:1)** and FOC. The low gear ratio multiplies torque ~6–10× while keeping the system **backdrivable** and preserving torque transparency — you can still sense and command torque accurately through the gearbox, because a 6:1 single stage has low friction and low reflected inertia compared to a 100:1 harmonic drive. This combination — pioneered visibly by the MIT Cheetah work and now standard in Unitree, mjbots, and most agile quadrupeds — gives you: - High torque density in a compact package. - Backdrivability for safe, compliant interaction and shock absorption (the leg "gives" on impact instead of shattering a gear). - Proprioceptive torque sensing from motor current — no separate torque sensor. - High control bandwidth for dynamic gaits. Contrast that with the **traditional servo approach**: a high-Kv motor and a 100:1+ harmonic/strain-wave gearbox (think industrial robot arms, Maxon EC + Harmonic Drive). That gives huge torque and stiffness and precision, but it is **not backdrivable**, has lash/elasticity, and hides the motor's torque behind gear friction. Great for a welding arm, wrong for a galloping leg. > Rule: for legged robots and force-controlled limbs, QDD (low-Kv outrunner + 6:1–10:1 planetary + FOC + encoder) is the default. For high-precision positioning arms where backdrivability doesn't matter, a high-ratio strain-wave gearbox on a smaller motor wins on torque density and stiffness. The open-source drives that made this accessible: **ODrive** (dual-axis FOC, popular for direct-drive and QDD builds) and **mjbots moteus** (compact integrated FOC controller designed expressly for quadruped actuators, CAN-FD, on-board magnetic encoder). ## Drone propulsion BLDCs vs robot-joint BLDCs Both are "BLDC motors," but they're optimized for opposite ends of the torque-speed plane, and confusing them is a common rookie error. ### Drone / propulsion motors The job is to spin a propeller fast and efficiently, always in one direction, always above idle speed. - **High Kv** (900–2700 for 5-inch quads; 100–400 for heavy-lift big props) — speed matters. - **Outrunner**, optimized for thrust-per-watt and weight, not torque at standstill. - **Sensorless six-step** commutation (or sensorless FOC on better ESCs like Hobbywing or BLHeli-32/AM32) — the prop never needs zero-speed torque, so no Hall sensors or encoder. - **Aggressive peak ratings**, light construction, minimal heatsinking — a 5-inch quad motor weighs ~30–50 g and relies on prop wash for cooling. - Examples: T-Motor F-series and iFlight/iPower for racing/freestyle; KDE Direct and T-Motor U/MN-series for heavy-lift; matched with Hobbywing or T-Motor ESCs. ### Robot-joint / actuator motors The job is to produce controllable torque across a range that includes zero speed, often bidirectionally, often holding against a load. - **Low Kv** (50–300) — torque matters, top speed doesn't. - **Outrunner** (for QDD) or **inrunner + high-ratio gearbox** (for stiff arms). - **Encoder-based FOC** — must have full torque at zero speed and torque sensing. - **Conservative continuous ratings**, robust thermal path to the joint structure, designed for thousands of hours. - Examples: Maxon EC/ECX + EPOS or gearhead for industrial; T-Motor/iPower gimbal motors + ODrive/moteus for robotics; integrated actuators like mjbots, Unitree, and CubeMars/AK-series. | Priority | Drone propulsion motor | Robot-joint motor | |---|---|---| | Kv | High (speed) | Low (torque) | | Direction | Unidirectional | Bidirectional | | Zero-speed torque | Not needed | Required | | Commutation | Six-step / sensorless FOC | Sensored FOC | | Feedback | Sensorless | Encoder | | Cooling | Prop wash, lightweight | Conduction into structure | | Lifetime target | 10s–100s of hours | 1000s+ of hours | | Failure mode of concern | Demag at full throttle | Thermal at sustained torque | ## Cooling, thermal management and duty cycle Because the continuous rating is a thermal limit, cooling is not an afterthought — it directly sets how much usable torque you get. The same motor can deliver 1.5× the continuous current with good thermal design. ### Where the heat goes Heat is generated mostly in the windings (I²R copper loss) and the iron (eddy and hysteresis loss, which rise with electrical frequency). It must travel: winding → stator iron → housing → ambient. Each interface has a thermal resistance; the sum is your R_th (K/W). - In an **inrunner**, the stator is the outer body, so heat conducts straight into the housing and out — good cooling. - In an **outrunner**, the windings are on the inner stator and the spinning can is on the outside; heat has to cross the air gap or go out the mounting face. Outrunners cool worse, which is why direct-drive joint motors often bolt the stator to a big aluminum structure that acts as a heatsink. ### Levers you control - **Mount to a heatsink.** Bolting the motor to the robot's aluminum chassis can drop R_th dramatically. A 3D-printed PLA bracket is a thermal blanket — it insulates. - **Airflow.** Forced convection (prop wash, a fan, or just an open chassis) can double the continuous rating versus a sealed enclosure. - **Higher voltage, lower current.** Same power at higher voltage means lower current means less I²R loss. Moving a 24 V drive to 48 V halves the current for the same power and cuts copper loss 4× — a big reason robot drivetrains are going to 48 V. - **Better winding (higher copper fill).** You can't change this after purchase, but it's why premium motors run cooler. ### Duty cycle and thermal time constant A motor has a thermal time constant — how long it takes to heat up. Small drone motors heat in seconds; big servomotors take minutes. This lets you exceed continuous current for short bursts as long as the **RMS current over your duty cycle** stays within the continuous rating. ``` I_rms = sqrt( mean( I(t)^2 ) ) over the motion cycle # keep I_rms <= I_continuous, even if peaks go higher briefly ``` A pick-and-place arm that accelerates hard (high peak current) then sits idle has a low RMS current and can use a smaller motor than its peak suggests. A motor holding a leg against gravity all day has its hold current as a continuous load — no duty-cycle relief. > Rule: compute RMS current over the actual motion profile, not the peak. Then check the peak fits within the seconds your thermal time constant allows. A motor with a fat thermal mass forgives spiky loads; a tiny one does not. ## Selecting a BLDC for a robot Here's the actual workflow for sizing a BLDC for a robot joint or drive. Do it in this order. ### 1. Define the load's torque-speed point(s) Work out the worst-case continuous torque and the worst-case speed at the **output** (after the gearbox). For a leg, that's the torque to hold/move the robot's mass through its gait; for a drive wheel, the torque to climb the worst grade at the target speed; for an arm, the torque at full extension plus dynamics. ### 2. Pick a gear ratio (if any) QDD legs: 6:1–10:1 single-stage planetary. Precision arms: strain-wave 50:1–160:1. Wheels: often direct or a low single stage. The ratio multiplies torque and divides speed, and it divides reflected inertia by the ratio squared. Reflect the load back to the motor: τ_motor = τ_output / (ratio · efficiency), ω_motor = ω_output · ratio. ### 3. Choose voltage Higher voltage = lower current for the same power = thinner wires, less loss, but more expensive electronics and tighter safety rules. Common robotics buses: 24 V (small), 36–48 V (mid), 48 V+ (high-power, the 2026 sweet spot for legged/AMR). Match your battery chemistry: a 6S LiPo is ~22–25 V, a 12S is ~44–50 V. ### 4. Pick Kv Choose Kv so Kv × V_pack lands ~10–20% above your required motor RPM (after reflecting through the gearbox). Then compute the current your torque demands: I = τ_motor / Kt, where Kt = 9.549 / Kv. Verify that current is below the motor's continuous rating with margin. ### 5. Choose the sensor and controller - Needs torque at zero speed (any joint/leg) → encoder + FOC drive (ODrive, moteus, Maxon EPOS). - Always spinning fast (prop, fan, free wheel) → sensorless six-step or sensorless FOC ESC (Hobbywing, BLHeli/AM32). - In between → Halls + FOC. ### 6. Check thermal margin Compute RMS current over the duty cycle; confirm it's under continuous with the cooling you'll actually have (derate for bad brackets). Confirm peak current is covered for the worst transient within the thermal time constant. ### Worked comparison table A rough guide to real parts across the robotics spectrum (specs approximate; always check the live datasheet): | Use case | Example part | Type | Kv | Voltage | Continuous | Sensor / control | |---|---|---|---|---|---|---| | 5-inch racing quad | T-Motor F40 Pro | Outrunner | ~1950 Kv | 4S–6S | ~35 A | Sensorless six-step ESC | | Heavy-lift prop | KDE Direct 4014XF | Outrunner | ~380 Kv | 6S–8S | ~40 A | Sensorless ESC | | Camera gimbal / light joint | iPower GM4108 | Outrunner | ~24–170 Kv | 12–24 V | a few A | FOC + encoder | | Quadruped leg (QDD) | mjbots / CubeMars AK80 | Outrunner + 6:1–9:1 | ~100 Kv class | 24–48 V | ~10–20 A | FOC + magnetic encoder | | Robot drive wheel | ODrive + hub/inrunner | Inrunner/outrunner | 150–300 Kv | 24–56 V | 20–60 A | FOC + encoder/Halls | | Precision arm joint | Maxon ECX + gearhead | Inrunner + strain-wave | (geared) | 24–48 V | per frame size | FOC (EPOS) + encoder | > Rule: never spec a motor from the peak/burst number on the box. Start from the continuous torque your load needs, reflect it through your gearbox to motor current via Kt, and leave 20–30% thermal headroom. The motor that "just barely fits" on paper runs hot and dies early. ## Frequently asked questions **Is a higher Kv motor more powerful?** No. Kv tells you speed per volt, not power. A high-Kv motor spins faster but makes less torque per amp; a low-Kv motor is the reverse. Power capability is set by current (heat), voltage, and the motor's physical size — not by Kv. Two motors of identical size with different Kv have nearly identical power capability; they just package it as different speed/torque combinations. **What's the real difference between a BLDC and a PMSM?** Physically, very little — both are three-phase permanent-magnet machines with electronic commutation. The conventional distinction is the back-EMF waveform: trapezoidal (called BLDC, suited to six-step commutation) vs sinusoidal (called PMSM, suited to FOC). In practice, modern FOC controllers drive both sinusoidally, so a "BLDC" run under FOC is operating as a PMSM. Spec the electrical constants and ignore the label. **Why do robot legs use low-Kv gimbal motors instead of geared servos?** A low-Kv motor makes high torque per amp and, under FOC with a small (6:1–10:1) planetary gear, stays backdrivable and lets you estimate torque from current — no torque sensor needed. That gives compliant, dynamic, force-controlled legs. A high-ratio geared servo gives more torque and stiffness but isn't backdrivable and hides torque behind gear friction, which is wrong for dynamic locomotion. **Can I run a BLDC without an encoder or Hall sensors?** Yes, sensorless, by estimating rotor angle from back-EMF. But back-EMF disappears near zero speed, so sensorless motors need an open-loop startup ramp and can't hold position or deliver smooth torque at standstill. That's fine for props, fans, and free-spinning wheels, and unacceptable for any joint that must hold a load. **What does the continuous current rating actually limit?** Heat. Continuous current is the steady current at which the winding temperature settles at the insulation limit (often 100–155 °C) for the motor's thermal resistance and assumed cooling. It is not a magnetic or torque ceiling — the motor can briefly produce far more torque (peak rating) until the windings overheat. Always design to continuous and derate for your real cooling. **Why are robot drivetrains moving from 24 V to 48 V?** Power is voltage times current, and losses are current squared times resistance. At double the voltage you halve the current for the same power, cutting copper (I²R) loss by 4×. That means cooler motors, thinner wires, smaller connectors, and higher continuous torque from the same hardware. The tradeoff is more expensive electronics and stricter safety handling. **How do I convert Kv to torque constant Kt?** Kt [N·m/A] ≈ 9.549 / Kv (with Kv in RPM/V). So a 900 Kv motor has Kt ≈ 0.0106 N·m/A. In SI units the back-EMF constant Ke (V per rad/s) equals Kt numerically. This is the single most useful conversion in BLDC selection: it turns the speed spec into a torque-per-amp number you can size current against. **Inrunner or outrunner — which should I pick?** Outrunner for direct-drive torque at low speed in a short package: props, gimbals, QDD legs. Inrunner for high speed and low rotor inertia that you then gear down: tools, EDF fans, many industrial servos. Outrunners cool worse (windings trapped inside the spinning can), so direct-drive joint motors lean on the mounting structure as a heatsink. **What's the typical efficiency of a BLDC?** 80–90% at the design point for a well-matched motor; large industrial servomotors reach 90–94%, while tiny drone motors at full throttle can drop into the 70s because of high current density and limited cooling. Efficiency is highest near the rated operating point and falls off badly at very low load (dominated by iron/friction losses) and near stall (dominated by I²R). **Why do high-pole-count motors stress ESCs?** Electrical frequency = mechanical RPM/60 × pole pairs. A 14-pole (7 pole-pair) motor at 6,000 RPM runs its field at 700 Hz; the ESC must commutate at that rate. High-pole-count motors (gimbal, direct-drive) demand fast commutation and high eRPM-capable controllers, and the higher electrical frequency also raises iron losses. **Do BLDC motors have cogging torque, and does it matter?** Yes — the rotor magnets prefer to align with stator teeth, producing small detent (cogging) torque even unpowered. It's worst in motors with certain slot/pole combinations and matters for smooth low-speed motion, haptics, and precise positioning. Skewed slots/magnets and good slot/pole pairings (like 12N14P) reduce it; FOC can partly compensate the residual. **What kills a BLDC motor in practice?** Three things: bearings wearing out (the usual end-of-life), permanent demagnetization of the rotor magnets from overheating (running peak current too long, or low-grade magnets above ~80 °C), and winding insulation failure from sustained over-temperature. All three trace back to heat — which is why thermal design and honest continuous-current sizing are the whole game. ## Changelog - **2026-06-16** — Initial publication. --- # Robot Wiring, Connectors & Slip Rings: The Ultimate Guide URL: https://blog.robo2u.com/posts/robot-wiring-cables-connectors-ultimate-guide/ Published: 2026-06-15 Updated: 2026-06-20 Tags: robot-wiring, cables, connectors, slip-rings, cable-management, continuous-flex, m12, emi-shielding, guide Reading time: 38 min > A practical engineering guide to robot wiring: wire gauge & ampacity, continuous-flex cable (Igus chainflex) vs standard, e-chains, M8/M12 connectors, EMI shielding & grounding, slip rings, and the flex-fatigue failures that quietly kill robots. Pull apart any robot that died in the field and there is a depressingly common autopsy result: nothing in the BOM failed. The motor was fine. The drive was fine. The controller was fine. What failed was a conductor that flexed three million times and finally cracked a strand, or a connector that fretted its way to intermittence, or a shield that was grounded at both ends and turned a chassis into an antenna. Wiring is the part of the machine that everyone treats as plumbing and that fails more often than anything you actually spec'd. This guide treats wiring as a first-class mechanical and electrical subsystem, because it is one. We will cover how to size a conductor from current and voltage drop, why a moving joint needs a completely different cable than a static panel, how drag chains and dress packs keep cable alive through millions of cycles, how to choose connectors that survive vibration and washdown, how to keep power noise out of your encoder feedback, and how to pass power, signal, and even fluid across a joint that rotates forever. Numbers carry units; opinions carry reasons. **The take**: wiring and flex-fatigue quietly kill robots, and they do it precisely because nobody owns them. The conductor on a moving axis is a *mechanical fatigue component* with a finite cycle life, exactly like a bearing — and like a bearing, it must be specified, rated, routed within a minimum bend radius, and replaced on a schedule. Treat continuous-flex cable, the dress pack, and the connector interface as primary design elements sized from the *motion profile* and the *current path* together, and your robot's MTBF is bounded by its silicon. Treat them as an afterthought and you will chase intermittent faults for the life of the machine. Companion reading: [robot power & batteries](/posts/robot-power-batteries-ultimate-guide/), [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/), [encoders](/posts/encoders-ultimate-guide/), [real-time control systems](/posts/real-time-control-systems-ultimate-guide/), [industrial automation: PLC/SCADA/fieldbus](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/), and [industrial robot arms](/posts/industrial-robot-arms-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [Why wiring is a first-class design problem](#first-class) 3. [Wire gauge, ampacity & voltage drop](#gauge-ampacity) 4. [Continuous-flex cable vs standard cable](#continuous-flex) 5. [The dress pack & cable management on moving arms](#dress-pack) 6. [Drag chains, e-chains & cable carriers](#drag-chains) 7. [Bend radius & strain relief rules](#bend-radius) 8. [Connectors: coding, families & IP ratings](#connectors) 9. [Industrial-network cabling](#network-cabling) 10. [EMI/EMC, shielding & grounding](#emi) 11. [Slip rings: continuous-rotation joints](#slip-rings) 12. [Labeling, harness build & service](#harness-build) 13. [Failure modes & preventive maintenance](#failure-modes) 14. [Frequently asked questions](#faq) ## Key takeaways - **A cable on a moving axis is a fatigue component, not a wire.** It has a bend-cycle rating the way a bearing has an L10 life. Spec it from the motion profile (bend radius, travel, acceleration, cycles/day), not from "whatever fits the gland." - **Flex fatigue is the #1 mechanical field failure of articulated and gantry robots.** Strands work-harden and crack; the failure is intermittent first (a flickering encoder, a dropping fieldbus node) and open-circuit later. It is invisible to a power-on bench test. - **Standard cable and continuous-flex cable are different products.** Continuous-flex uses fine high-strand-count copper, short-lay bundle stranding around a central core, and a low-friction jacket. A Lapp Ölflex Classic panel cable in an e-chain will fail in weeks; an [Igus chainflex] rated for 10+ million cycles will not. - **Size conductors for two limits at once**: ampacity (the thermal limit, so insulation doesn't cook) and voltage drop (the functional limit, so the motor or logic rail actually gets its volts). On long DC runs, voltage drop usually wins and forces a larger gauge than ampacity alone — see [robot power & batteries](/posts/robot-power-batteries-ultimate-guide/). - **AWG and mm² are not the same scale.** Memorize a few anchors: AWG 18 ≈ 0.82 mm², AWG 14 ≈ 2.08 mm², AWG 10 ≈ 5.26 mm². Three AWG numbers down doubles the area. - **The e-chain (drag chain / cable carrier) is sized by fill, bend radius, and separation**, not just by "will the bundle fit." Round cables want ~10–20% clearance, no stacking unprivileged, and a bend radius that respects each cable's own minimum. - **Bend radius is the master rule.** Continuous-flex cable typically needs a dynamic bend radius of 7.5–10× outer diameter (×d); fixed installation tolerates 4–5×d. Violate it and rated cycle life evaporates. - **M12 connectors are the industrial default for the moving end of a robot.** Learn the coding: A-code for sensors/DC, B-code for legacy fieldbus, D-code for 100 Mbit Ethernet, X-code for Gigabit, and L/T/S/K power codes. IP65/IP67/IP69K define what survives washdown. - **Keep power and signal physically separated**, route them in different e-chain compartments or different chains, and shield the signal. A VFD or servo drive cable run next to an encoder cable is the classic source of phantom faults — see [encoders](/posts/encoders-ultimate-guide/). - **Shield grounding is a decision, not a default.** Ground the shield at one end for low-frequency signal cables to avoid ground loops; ground both ends (360° to the connector backshell) for high-frequency and drive cables. Get this wrong and you inject noise instead of rejecting it. - **Fieldbus cabling has hard rules**: shielded Cat5e/Cat6 for EtherCAT/PROFINET, twisted pairs, 100 m max copper segment, and connector pinouts that must match the coding. See [industrial automation](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/) and [real-time control](/posts/real-time-control-systems-ultimate-guide/). - **A slip ring is how you cross a continuously rotating joint** — turret, pan axis, rotary table — with power, signal, and sometimes fluid/pneumatics. Brush rings are cheap and lossy; fiber-brush gold-on-gold (Moog) and capsule rings (Servotecnica) carry clean signal and Ethernet across an infinite rotation. - **Label everything and document the harness.** A wire that isn't labeled and a harness that isn't drawn cost you an hour per fault, forever. Service time is a design output. ## Why wiring is a first-class design problem Here is the mental shift that separates robots that run for years from robots that generate service tickets: on a moving machine, the cable is a moving part. It is subjected to bending, torsion, tension, acceleration, abrasion, and temperature cycling, millions of times. A six-axis arm doing a pick-and-place at 30 cycles per minute, two shifts a day, racks up roughly **5.6 million bend cycles per year per flexing point**. That is squarely in fatigue territory for copper. Copper doesn't care that it's carrying your control loop. It work-hardens when you bend it repeatedly. Each bend cycle plastically deforms the outer strands of a conductor; the strands accumulate damage and eventually crack. When a few strands in a bundle break, the conductor's resistance rises and its current capacity drops — but it still passes a continuity test. That is the cruelest part of flex fatigue: **the cable that's about to fail looks perfect on a meter.** It only misbehaves at the specific bend angle and the specific load where the cracked strands lose contact, which is exactly the operating condition, not the bench condition. > **Rule:** Any conductor that moves with the robot is a finite-life fatigue component. Give it a cycle rating, a minimum bend radius, a service interval, and a place in the maintenance log — the same as a bearing or a belt. The field data backs this up. Across industrial automation, the single most common cause of unplanned robot-cell downtime that isn't a process fault is a cable or connector in the dress pack — a cracked conductor in a continuous-flex cable that was under-rated or over-bent, a connector that fretted loose under vibration, or a shield that broke at a strain point and let noise in. These are not exotic failures. They are the default outcome of treating wiring as plumbing. So we design wiring the way we design any other fatigue-loaded subsystem. We separate the static plant wiring (inside the cabinet, in cable tray, behind panels) from the dynamic wiring (anything that flexes with motion). The static stuff is easy and forgiving; standard panel cable, generous routing, screw terminals. The dynamic stuff is where the engineering lives: continuous-flex cable, e-chains, dress packs, slip rings, and connectors chosen for vibration and cycle life. Get the dynamic third of the wiring right and the machine lasts. ## Wire gauge, ampacity & voltage drop A conductor has two independent sizing constraints, and you must satisfy both. **Ampacity** is the thermal limit: how much current the conductor can carry continuously before its insulation overheats. Push too much current and the I²R loss in the copper raises the conductor temperature past the insulation rating (typically 80 °C, 90 °C, or 105 °C), degrading it. Ampacity depends on conductor area, insulation temperature rating, and — critically — the cooling environment. A wire bundled in an e-chain with twenty others, surrounded by jacket and chain, runs much hotter than the same wire in free air. That's **derating**, and it's where most people get burned (sometimes literally). **Voltage drop** is the functional limit: how much of your supply voltage gets eaten by the resistance of the run before it reaches the load. On a robot's low-voltage DC bus, this matters enormously. The resistance of copper at 20 °C is: ``` ρ_copper = 1.72e-8 Ω·m (resistivity at 20 °C) R = ρ · L / A where L = total conductor length (m), A = cross-section (m²) For a round-trip DC run, use L = 2 × (one-way distance), because the current returns on the negative conductor. ``` Worked example — a 30 A motor feed on a 24 V bus, 4 m one way (8 m round trip), 2.5 mm² copper: ``` A = 2.5 mm² = 2.5e-6 m² L = 8 m (round trip) R = 1.72e-8 × 8 / 2.5e-6 = 0.055 Ω V_drop = I × R = 30 A × 0.055 Ω = 1.65 V % drop = 1.65 / 24 = 6.9% P_loss = I² × R = 30² × 0.055 = 49.5 W burned in the cable ``` Nearly 7% drop and 50 W of heat dumped into the cable on a single feed — that is a problem. Upsize to 6 mm²: ``` R = 1.72e-8 × 8 / 6e-6 = 0.023 Ω V_drop = 30 × 0.023 = 0.69 V → 2.9% P_loss = 30² × 0.023 = 20.7 W ``` > **Rule:** Budget total DC voltage drop to ≤3% on power feeds and ≤1% on sensitive logic/sensor rails. On long low-voltage runs, voltage drop — not ampacity — usually sets the gauge. See [robot power & batteries](/posts/robot-power-batteries-ultimate-guide/) for how bus-voltage choice (24 V vs 48 V) changes all of this: doubling the bus voltage quarters the I²R loss for the same power. Now the derating. Published ampacity tables assume a single conductor in open air at a reference ambient (often 30 °C). Inside an e-chain bundle you apply two corrections — a bundle factor and an ambient factor — and the deratings multiply: ``` I_allowed = I_table × k_bundle × k_ambient Typical bundle factor (k_bundle), conductors carrying current: 3 conductors: ~0.70 6 conductors: ~0.55 10+ conductors: ~0.40–0.50 Ambient factor (k_ambient) for 90 °C insulation: 30 °C: 1.00 40 °C: 0.91 50 °C: 0.82 60 °C: 0.71 ``` A wire rated 25 A in free air, bundled with ten others at 50 °C ambient, might be good for `25 × 0.45 × 0.82 ≈ 9 A`. People who skip this step build harnesses that run hot, age the insulation, and create the exact thermal-cycling that accelerates flex fatigue. Here is a practical reference table. Ampacity values are conservative single-conductor figures for chassis/power wiring; derate for bundling as above. | AWG | Area (mm²) | Ω/km (20 °C) | ~Ampacity, free air (A) | Typical robot use | |---|---|---|---|---| | 22 | 0.33 | 52.7 | 3–5 | Low-current signal, encoder pairs | | 20 | 0.52 | 33.3 | 5–8 | Sensor power, small signals | | 18 | 0.82 | 20.9 | 10–16 | Logic feeds, small actuators, M8/M12 sensor leads | | 16 | 1.31 | 13.2 | 13–22 | Small motor feeds, brakes, fans | | 14 | 2.08 | 8.3 | 20–32 | Servo phase leads (small), 24 V distribution | | 12 | 3.31 | 5.2 | 28–41 | Motor feeds, main DC branches | | 10 | 5.26 | 3.3 | 40–55 | Drive-to-motor, high-current branches | | 8 | 8.37 | 2.1 | 55–75 | Main bus, battery feeds | | 6 | 13.3 | 1.3 | 75–101 | Battery main, inverter feeds | | 4 | 21.2 | 0.82 | 100–135 | High-power packs, big drives | > **Rule of thumb worth memorizing:** every 3 AWG steps down roughly doubles the cross-sectional area (and halves the resistance). AWG 10 has ~2× the copper of AWG 13, ~4× of AWG 16. And resistivity climbs with temperature at about +0.39%/°C, so a conductor at 70 °C has ~20% more resistance than the 20 °C table value — fold that into voltage-drop budgets on hot runs. For drive-to-motor wiring specifically, follow the drive manufacturer's gauge table, because PWM current has an RMS value higher than the DC-equivalent and the cable is part of the EMC system. See [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/). ## Continuous-flex cable vs standard cable This is the single most important material choice in robot wiring, and the one most often gotten wrong by people building their first machine. Standard cable and continuous-flex (high-flex) cable look identical from the outside. They behave completely differently when bent millions of times. **Standard cable** — your everyday panel wire, Lapp Ölflex Classic 110/100, building wire, generic hookup wire — uses relatively coarse copper strands (think 7 or 19 strands for a given gauge), often stranded in simple concentric layers, with a jacket optimized for cost and chemical resistance rather than flex life. It's perfectly good for fixed installation: in a cabinet, in tray, behind a panel, anywhere it doesn't move. Put it in an e-chain and it dies — the coarse strands work-harden fast, the layers slide and abrade against each other, and the jacket cracks. Failure in weeks to months under continuous flex. **Continuous-flex cable** (Igus chainflex, Lapp Ölflex FD/Chain series, Helukabel, TKD) is engineered for the e-chain. The defining features: - **Fine, high strand-count conductors.** Many thin strands (e.g. 0.05–0.1 mm each) instead of a few thick ones. Strain per strand drops, so each strand survives more bend cycles. This is the single biggest contributor to flex life. - **Short-lay bundle stranding around a central core.** The conductors are stranded with a short, tight pitch in bundles laid helically around a central tension-bearing element. As the cable bends, conductors can shift along the helix instead of stretching, distributing strain. Chainflex literature calls this the "bundle stranding with optimized lay length." - **Gusset-filled, pressure-extruded jackets.** The jacket fills the spaces between bundles (gusset fill) so conductors can't migrate, and it's extruded under pressure to grip the core as a unit. Low-friction, abrasion-resistant TPE or PUR outer jacket so the cable slides cleanly through the chain. - **Tight, controlled dimensions** so it sits predictably in the e-chain and respects fill rules. Igus markets chainflex with a specific reliability model worth understanding: they publish a **guaranteed bend-cycle / service-life** figure for each cable at a stated bend radius (in multiples of outer diameter, ×d), and back it with a 36-month guarantee. The model is essentially "this cable will achieve X million double-strokes at a bend radius of Y×d." A bus cable might be rated for 5 million cycles at 10×d; a premium servo cable for 50+ million at 7.5×d. The relationship is steep: relax the bend radius and life climbs; tighten it below spec and life collapses non-linearly. | Property | Standard cable (e.g. Ölflex Classic 110) | Continuous-flex (e.g. Igus chainflex) | |---|---|---| | Strand construction | Coarse, few strands (7/19) | Fine, high strand count, bundle-stranded | | Central core | Usually none | Tension-bearing central element | | Jacket | Cost-optimized PVC | Low-friction PUR/TPE, gusset-filled | | Dynamic bend radius | Not rated for continuous flex | 7.5–12.5×d (rated) | | Bend-cycle life | Unrated; fails in weeks in e-chain | 5–50+ million cycles (guaranteed) | | Torsion capability | None | Torsion-rated variants for robot arms | | Cost (relative) | 1× | 2–5× | | Use | Static: cabinet, tray, fixed runs | Dynamic: e-chains, dress packs, arms | For a robot **arm** specifically, you need more than e-chain (linear-flex) cable — you need **torsion-rated** cable, because arm joints twist the cable about its own axis, not just bend it. Igus chainflex has dedicated robot/torsion variants (the CFROBOT series) rated in degrees of twist per meter over millions of cycles. Standard e-chain cable bent in torsion fails fast because the strand geometry is optimized for bending, not twisting. > **Rule:** Never put standard panel cable in a moving application. If it flexes with the machine, it must be a rated continuous-flex cable (linear-flex for e-chains, torsion-rated for arm joints). The cost premium is 2–5×; the failure-rate difference is 100×. A practical note on procurement: continuous-flex cable is sold by both the meter and as pre-assembled "readycable" / readychain harnesses (Igus, Lapp). For low volumes, buying pre-assembled and pre-tested harnesses is often cheaper than the labor and tooling to build and verify your own, and it comes with the same cycle guarantee. ## The dress pack & cable management on moving arms On an articulated robot — a six-axis arm, a [collaborative robot](/posts/collaborative-robots-cobots-ultimate-guide/), a humanoid limb — the bundle of cables and hoses that runs from the base to the tool is called the **dress pack** (also "dressing" or "umbilical"). It carries motor power, encoder feedback, brake supply, tool I/O, pneumatics, fluids, and sometimes vision/network. It is the single most failure-prone subsystem on a working arm, because it has to follow the most complex motion in the machine. The core problem: as the arm articulates, the dress pack must extend, retract, bend, and twist, all while staying out of the work envelope, off the part, and clear of pinch points. Do it badly and you get cables snagging, abrading on the structure, kinking at a joint, or — most commonly — accumulating torsion at axis 4 and axis 6 (the wrist roll axes) until a conductor cracks. The dressing strategies, roughly in order of sophistication: - **External dress pack with retraction.** The classic: a corrugated hose or sleeve carrying the bundle runs along the outside of the arm, managed by spring-return retraction units, swivels, and clamps (Leoni, Murrplastik, Igus triflex R). The triflex R is purpose-built for arms — a 3D-articulating cable carrier that bends and twists with the wrist while enforcing a minimum bend radius and limiting torsion. - **Through-arm / internal routing.** High-end arms route cables internally through hollow joints. Cleaner and protected, but tighter bend radii and harder to service. Whoever designs the joint must reserve the internal cable channel and respect the bend radius through every axis. - **Hybrid.** Internal through the lower axes, external dress pack from axis 3 to the tool, where the motion is most complex and serviceability matters most. The killer on arms is **torsion at the wrist.** Axis 6 (and often 4) rotates continuously or near-continuously over a wide range. A cable clamped on both sides of that joint sees the full twist concentrated in a short length — degrees-of-twist-per-meter shoots up and the conductor fails. The fixes: use torsion-rated cable (CFROBOT), allow a generous free length of cable across the joint so the twist is distributed over more length, use swivels that let the dress pack rotate with the axis instead of fighting it, and — past a certain duty — give up on cable entirely and use a **slip ring** at the rotating joint (covered later). > **Rule:** On an arm, design the dressing for *torsion first, bending second.* Reserve free cable length across rotary joints so twist is distributed; clamp the dress pack at the joints, not in the middle of a flex zone, so motion happens where the cable is rated for it. Worth saying plainly: this is mechanical design, not electrical. The cable engineer and the mechanical designer have to sit together while the arm is still in CAD. The number of robots whose dress pack was "figured out later" and now eats a service visit every few months is enormous. ## Drag chains, e-chains & cable carriers The **energy chain** — e-chain, drag chain, cable carrier, cable track — is the articulated plastic (or steel) chain that guides and protects cables along a linear axis: gantries, linear actuators, the X/Y/Z of a CNC or 3D printer, the travel of an AMR's docking arm, the long axis of a SCARA's traverse. Igus is the dominant name (the term "e-chain" is theirs); Kabelschlepp (Tsubaki), Murrplastik, and Brevetti are the other major suppliers. The e-chain does three jobs: it enforces a **minimum bend radius** (the cables can never bend tighter than the chain's radius), it **separates and guides** cables so they don't tangle or abrade, and it **protects** them from the environment and from being snagged. The cable still has to be continuous-flex — the chain just guarantees it bends within spec. ### Fill rules How you pack the e-chain is most of the game. The cardinal rules: - **Clearance.** Round cables need radial clearance to move within the chain. Igus recommends roughly **10% diameter clearance** for cables that should lie freely and up to **20%** for cables that need to move axially within the chain (which long e-chains require). Pack them tight and they bind, abrade, and corkscrew. - **No uncontrolled stacking.** Cables should lie side by side in a single layer where possible. If you must stack, use horizontal **shelf dividers** so the upper layer can't crush or abrade the lower one. Cables lying loose on top of each other in a long-travel chain will migrate, twist, and fail. - **Separate by type and size.** Use vertical dividers to give each cable (or small group) its own compartment. Crucially, keep **power away from signal** (EMC) and **keep large heavy cables separate from small light ones** so the heavy ones don't crush the light ones at the bend. - **Weight balance.** Distribute cables so the chain's weight is symmetric about its center; an unbalanced chain tilts and wears one side. - **Fill fraction.** As a working limit, keep the filled cross-section under ~60–80% of the chain's usable interior so cables can move. > **Rule:** In an e-chain, place the heaviest cables at the outside, lightest in the middle, give every cable its own compartment via dividers, and keep at least 10% diameter clearance. Power and signal go in separate compartments, ideally with a grounded divider or separate chains. ### Bend radius and the chain itself Every e-chain has a **bend radius (KR)** — the radius it forms at the curve. This must be **larger than or equal to the largest cable's minimum dynamic bend radius.** If your biggest cable needs 10×d and that works out to 90 mm, the chain's KR must be ≥90 mm. Choosing a chain with too small a KR to save space is a classic way to kill the cables it's supposed to protect. Other chain sizing parameters: - **Travel length and unsupported length.** Short chains run **unsupported** (gliding self-supported in an arc). Beyond an unsupported limit (depends on chain size and load), the upper run sags and you need a **gliding** configuration where the upper run rides on the lower run in a guide trough. Long-travel gantries (many meters) are always gliding. - **Speed and acceleration.** E-chains have max speed (often up to 10 m/s for unsupported, less for gliding) and acceleration ratings. High dynamics drive you to lighter chains and tighter fill control. - **Inner height/width.** Pick from the fill once you've laid out compartments and clearances. For a typical robot linear axis: pick the chain KR from your largest cable's bend radius, lay out the cables with dividers (power separated from signal), keep 10% clearance, verify the fill fraction, and confirm the travel is within the unsupported limit or specify a guide trough. Igus and Kabelschlepp both have online configurators that do this sizing if you feed them the cable list. ## Bend radius & strain relief rules Bend radius is the master constraint of robot wiring. Get it wrong and nothing else matters — your perfectly chosen continuous-flex cable will fail at a fraction of its rated life because you bent it too tight somewhere. The convention is **multiples of outer diameter (×d).** A cable with 12 mm OD bent at 8×d has a 96 mm bend radius. The numbers split by application: | Application | Typical minimum bend radius | |---|---| | Fixed installation (no movement) | 4–5×d | | Occasional flex (e.g. service loops) | 7.5×d | | Continuous flex in e-chain (linear) | 7.5–12.5×d | | Torsion (robot arm joints) | 10–15×d (per cable spec) | | Bus/Ethernet data cable, dynamic | 10×d (often stricter) | > **Rule:** Use the *largest* required bend radius among all cables in a bundle as the design radius for the whole bundle, and round up. It costs almost nothing to give a cable a bigger radius; it costs a field failure to give it a smaller one. Data and coax cables are often stricter than power cables because tight bends change impedance and degrade signal — a Cat6 cable bent below its minimum radius can fail certification even if it's mechanically fine. Always check the data cable's spec separately. ### Strain relief Strain relief keeps mechanical load — tension, weight, vibration — off the *electrical termination.* The conductor-to-terminal joint (crimp, solder, IDC) is the weakest point in any harness; if the cable can pull or wiggle at that joint, it will fatigue and fail there. Rules: - **Anchor the cable, not the conductor.** Clamp the jacket near every connector and at intervals along the run. The connector's strain-relief gland or backshell grips the jacket; the conductors inside should have a tiny bit of slack so they're never in tension. - **Service loop.** Leave a service loop (a deliberate slack length, often a gentle loop one bend-radius wide) at each connector so you can re-terminate after a failure without re-pulling the whole run, and so thermal expansion and vibration don't load the joint. - **No flex at the termination.** Connectors and terminations belong in static zones. The flexing must happen in the middle of a rated cable, never at the connector. Clamp on both sides of any flex zone so the motion is contained where the cable is rated for it. - **Respect the gland.** Cable glands (PG/metric) and connector backshells are rated for a cable OD range and an IP rating only when tightened on the right OD. A gland on too-thin a cable doesn't seal or grip. A huge fraction of "the connector failed" tickets are actually strain-relief failures: the cable flexed at the connector, fatigued the conductor right at the crimp, and went open. Fix the mechanics and the connector is fine. ## Connectors: coding, families & IP ratings Connectors are where electrical and mechanical reliability meet, and where vibration goes to do its damage. A connector has to make a low-resistance, stable contact through thousands of mating cycles and millions of vibration cycles, often through dust, coolant, and washdown. Choosing the right family and rating is half of robot wiring reliability. ### Circular connectors: M8 and M12 The **M12 circular connector** (12 mm threaded coupling) is the workhorse of the moving end of industrial robots and automation. **M8** is its smaller sibling for tighter spaces and lower current. They're rugged, vibration-tolerant (screw-locked), available sealed to IP67/IP69K, and — critically — **coded** so you physically can't plug a power cable into an Ethernet port. Learn the coding, because it's the whole point: | Code | Typical use | Pins | Notes | |---|---|---|---| | **A-code** M12/M8 | Sensors, actuators, DC power, DeviceNet, CANopen | 3/4/5/8 | The default. Sensor leads, valve manifolds, general I/O | | **B-code** M12 | PROFIBUS, legacy fieldbus | 5 | Older fieldbus; declining | | **C-code** M12 | AC sensors/actuators | 4/5 | Less common | | **D-code** M12 | Fast Ethernet (100 Mbit/s), PROFINET, EtherCAT | 4 | The classic industrial-Ethernet connector | | **X-code** M12 | Gigabit Ethernet (1/10 Gbit/s) | 8 | Shielded, 4 pairs; modern data standard | | **K-code** M12 | AC power | 4+PE | | | **L-code** M12 | DC power (Profinet PoE, drives) | 4+FE | Common for 24 V power distribution | | **S/T-code** M12 | AC / DC power (higher current) | 3+PE / 4+FE | T-code for 24 V DC up to ~12 A | > **Rule:** Match the connector code to the signal, every time. A D-code is 100 Mbit Ethernet; if you need Gigabit (for a 3D camera or a high-rate fieldbus), you need X-code. Specifying the wrong code is a redesign, not a field fix. M12 connectors come field-wireable (terminate in the field — screw, IDC, or push-in) or pre-molded onto cable (factory-sealed, more reliable IP rating, better flex life). For moving applications, **pre-molded over-molded leads on continuous-flex cable** beat field-wired every time on both IP integrity and flex life. Major suppliers: Phoenix Contact, Harting, TE Connectivity, Binder, Lumberg, Murrelektronik, Turck. ### IP ratings The **IP (Ingress Protection) code** (IEC 60529) is two digits: first = solids/dust, second = water. - **IP65** — dust-tight, protected against low-pressure water jets. Fine for general factory environments. - **IP67** — dust-tight, protected against temporary immersion (1 m, 30 min). The common robot default. - **IP69K** — dust-tight, protected against high-pressure, high-temperature washdown (80 °C, 80–100 bar). Required for food, pharma, and anywhere that gets pressure-washed. A connector only achieves its IP rating **when mated and torqued**, and an unmated port needs a sealing cap to maintain it. The cable, gland, and backshell all have to meet the rating too — the chain is only as sealed as its weakest link. ### Heavy-duty rectangular: Harting For multi-circuit, high-power, or mixed power+signal connections — control cabinet to machine, drive to motor, modular tooling — the **Harting Han** series (and competitors: TE HDC, Amphenol, Weidmüller) is the standard. A rectangular metal or plastic hood houses interchangeable insert modules: power contacts, signal contacts, pneumatic, even fiber, in one connector with a lever-lock hood rated to IP65/IP66/IP68. The **Han-Modular** system lets you build exactly the contact mix you need. This is how you make a robot tool or a machine module quick-change. ### D-sub The **D-subminiature** (DB9, DB15, DB25, high-density variants) persists in robotics for encoder feedback, serial, and legacy drive I/O. It's cheap, available, and reliable in static low-vibration use — but the standard latching (jackscrews) is mediocre against vibration unless you actually screw it down, and it's not sealed without a hood. Fine inside a cabinet; questionable on a moving axis. Many servo drives still use D-sub for encoder and command I/O — see [encoders](/posts/encoders-ultimate-guide/). ### Power connectors For DC power distribution and battery connections, the dominant families: - **Anderson Powerpole** — genderless, modular, hot-pluggable, color-coded, 15–45 A in the common PP15/30/45 housings. Ubiquitous in mobile robots, amateur and prototype power. The genderless design means one part number for both ends, and you can gang them into custom arrangements. - **Anderson SB series** (SB50, SB120, SB175, SB350) — high-current battery and charging connectors, 50–350 A, color-keyed by voltage so you can't cross-connect a 24 V and a 48 V charger. The standard for AMR/AGV battery and charge interfaces. - **Molex** (Mega-Fit, Mini-Fit Jr., Micro-Fit) — board-to-wire and wire-to-wire power, a few amps to ~20 A per circuit, dense and cheap. The backbone of internal robot power distribution. - **Phoenix Contact / Wago** push-in and spring-cage terminal blocks — the cabinet standard. Spring-cage (push-in) terminals are vibration-proof in a way screw terminals are not; they don't loosen. For anything that vibrates, prefer spring-cage over screw terminals. - **TE / Molex board-to-board** — mezzanine and backplane connectors for stacking PCBs inside compute and drive enclosures. > **Rule:** For battery and charge connections, use mechanically keyed, current-rated, color-coded connectors (Anderson SB) so a 24 V and a 48 V interface are physically impossible to cross-connect. The cost of the mistake is a fire. See [robot power & batteries](/posts/robot-power-batteries-ultimate-guide/). A word on contacts: connector reliability is mostly about the contact interface. Gold plating resists corrosion and fretting and is worth it on signal contacts; tin is cheaper and fine for power where contact pressure is high. **Fretting corrosion** — micro-motion under vibration that wears through plating and builds insulating oxide — is the silent connector killer, and it's why screw-locked, gas-tight, vibration-rated connectors matter on a robot. ## Industrial-network cabling Modern robots are networked machines: the drives talk EtherCAT, the safety PLC talks PROFINET or PROFIsafe, the vision system streams over GigE, and the [PLC/SCADA layer](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/) ties it together. Network cabling has its own rules, and they're stricter than power wiring because the failure mode is data corruption, not just resistance. **Industrial Ethernet** (EtherCAT, PROFINET, EtherNet/IP) runs on shielded twisted-pair copper. The essentials: - **Use shielded cable (S/FTP or SF/UTP)**, not the unshielded Cat5e you'd run in an office. Industrial environments are electrically hostile; the shield is mandatory. Cat5e is good to 100 Mbit (and the D-code M12 standard); Cat6/Cat6a for Gigabit (X-code M12). - **Twisted pairs reject common-mode noise.** The twist is the whole reason Ethernet survives near drives — differential signaling on twisted pairs cancels induced noise. Don't untwist more than ~13 mm at a termination. - **100 m maximum copper segment.** This is the hard physical limit for Ethernet over copper (including patch leads). Beyond it, fiber. EtherCAT and PROFINET inherit this 100 m node-to-node limit. - **Bend radius for data cable is strict** — typically 8–10×d static and more dynamic, and a tight bend changes impedance and can fail the link. - **For moving applications, use continuous-flex Ethernet cable** (Igus chainflex CFBUS, Lapp Ethernet FD). Standard Cat6 in an e-chain fails like any standard cable. Chainflex bus cables are rated for the same millions-of-cycles model as their power cables. > **Rule:** Real-time fieldbus is intolerant of cabling sloppiness. EtherCAT's distributed-clock sync and the determinism of [real-time control](/posts/real-time-control-systems-ultimate-guide/) assume clean physical layer. A marginal cable that "mostly works" produces dropped frames, re-transmits, and jitter that show up as intermittent motion faults — not as a clean network error. For the older **fieldbuses** — CANopen, DeviceNet, PROFIBUS — the rules are similar but the limits differ: CAN bus needs a 120 Ω termination resistor at each end of the trunk and a maximum length that drops as bit rate rises (e.g. ~40 m at 1 Mbit/s, ~500 m at 125 kbit/s). PROFIBUS DP wants the specific purple shielded cable and matched terminators. Get the termination wrong on a CAN bus and you get reflections, errors, and a node that drops off under load. The fieldbus details live in the [industrial automation guide](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/). ## EMI/EMC, shielding & grounding A robot is an electromagnetic nightmare by construction: PWM drives switching tens of amps at tens of kilohertz with nanosecond edges, sitting centimeters from millivolt encoder signals and megahertz fieldbus data. Electromagnetic compatibility — keeping the noisy parts from corrupting the quiet parts — is a design discipline, not a fix you add when the encoder glitches. The three mechanisms of coupling, and the defenses: - **Capacitive (electric-field) coupling** — fast voltage edges couple through stray capacitance. Defense: shielding (the shield intercepts the field), and physical separation. - **Inductive (magnetic-field) coupling** — changing currents induce voltage in nearby loops. Defense: minimize loop area (twisted pairs, tight power+return), separation, and keep aggressor and victim cables crossing at 90°, never running parallel. - **Conducted coupling** — noise riding on shared conductors and ground returns. Defense: separate returns, single-point grounding for sensitive circuits, and filtering (ferrites, common-mode chokes). ### The separation rule The cheapest, most effective EMC measure is physical separation. Power and motor cables are aggressors; signal, encoder, and data cables are victims. > **Rule:** Keep motor/drive power cables and signal/data cables in separate routes — separate e-chain compartments, separate trays, separate conduits — with as much air between them as you can afford. If they must cross, cross at 90°. Never run a servo cable parallel and adjacent to an encoder cable for any distance. A rough working guide from automation practice: maintain ≥100–200 mm separation between power and signal cables running in parallel, more for long parallel runs, and use a grounded steel divider in shared trays. ### Shield grounding — the decision that trips everyone Shielding only works if the shield is grounded correctly, and "correctly" depends on frequency. This is the single most misunderstood topic in robot wiring. - **Single-end grounding (one end only):** ground the shield at one end (usually the source/cabinet end) for **low-frequency analog signals** (thermocouples, slow analog sensors, audio). This prevents a **ground loop** — if you ground both ends and the two grounds are at different potentials, current flows through the shield and injects noise. Single-end grounding gives the shield a drain without a loop. - **Both-end grounding (360° at each end):** ground the shield at **both ends**, connected 360° around to the connector backshell or an EMC gland, for **high-frequency signals, data cables, and drive cables.** At high frequency the shield must be grounded both ends to be effective against the dominant coupling, and the small ground-loop current is the lesser evil. The 360° termination is critical — a "pigtail" (twisting the shield into a wire and landing it on a pin) ruins high-frequency shield performance because it adds inductance. Use EMC cable glands and backshells that clamp the shield all the way around. > **Rule:** Low-frequency analog → ground the shield at one end. High-frequency / data / drive cables → ground both ends with a 360° backshell or EMC gland. Never pigtail a shield on a high-frequency cable. For the motor cable specifically (the worst aggressor), follow the drive maker's EMC guide to the letter: shielded motor cable, shield bonded 360° to the drive's EMC plate at the drive end and to the motor housing at the motor end, with the shortest possible pigtail-free connection. This is non-negotiable for passing CE/EMC and for not corrupting your own feedback. See [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/). ### Ferrites and filters Snap-on **ferrite cores** add common-mode impedance at high frequency — cheap insurance on signal and data cables where you've got a residual noise problem. **Common-mode chokes** do the same, designed in. They're a complement to, not a substitute for, proper shielding and separation. If you find yourself adding ferrites to fix a problem, first check that your separation and shield grounding are right — ferrites are a patch, not a foundation. ### Grounding architecture Establish a **single-point (star) ground** reference for sensitive electronics so all signal returns reference one node and you don't build ground loops through the chassis. Keep the **power ground** (motor returns, high current) separate from the **signal ground** (logic, sensors) and tie them at one carefully chosen point. The protective earth (PE) bonds the chassis for safety and is its own conductor. Conflating these three grounds is how motor current ends up flowing through your encoder return and your control loop starts seeing phantom position errors. ## Slip rings: continuous-rotation joints Sometimes a joint rotates not just back and forth but **continuously, forever** — a radar turret, a camera pan axis with unlimited rotation, a rotary indexing table, a wind-turbine pitch system, a cable reel. You cannot run a cable across that joint; it would wind up and snap in a few revolutions. The device that solves this is the **slip ring** (also rotary electrical joint, rotary union for fluids). A slip ring transfers power and/or signal between a stationary part (stator) and a rotating part (rotor) through a sliding electrical contact: conductive rings on the rotor, with **brushes** (or fiber brushes, or liquid metal) riding on them from the stator. The rotor turns indefinitely; the contacts maintain electrical connection through the rotation. ### Contact technologies - **Composite/metal brush rings** — the classic. Carbon or metal-graphite brushes on metal rings. Cheap, high current capability, but electrically noisy (variable contact resistance), wear over time (brush dust, finite brush life), and not great for clean signal. Fine for power; poor for sensitive data. - **Precious-metal (gold-on-gold) fiber-brush rings** — multiple fine gold-alloy wire brushes contacting a gold ring. Many parallel contact points mean low, stable contact resistance and very low electrical noise — good enough for encoder signals, low-level analog, and bus data. **Moog** is the reference here; their fiber-brush technology is the standard for clean signal transfer across rotation. Longer life, lower maintenance, higher cost. - **Capsule slip rings** — small-diameter, pre-packaged units (often gold-on-gold) for low-current signal and power in compact rotary joints. **Servotecnica** (and Moog, Stemmann, JINPAT) make capsule rings rated to carry Ethernet, USB, video, and bus protocols across rotation. - **Liquid-metal / mercury-wetted** — extremely low, stable contact resistance and low noise, used for high-fidelity signal, but with handling/safety constraints (mercury) that have pushed most applications to fiber-brush. ### What you can pass A modern slip ring is a hybrid module. A single unit can carry, in concentric ring groups: - **Power** — from a few amps to hundreds of amps per ring; high-current rings for the motor/drive bus. - **Signal/data** — encoder, analog, and increasingly **Ethernet (including Gigabit), EtherCAT, PROFINET, CAN, USB, and HDMI/video** through dedicated high-bandwidth channels (sometimes capacitive or contactless rotary couplers for the highest data rates). - **Fluid and pneumatics** — combine the slip ring with a **rotary union** (a coaxial fluid joint) through a hollow-bore (through-bore) slip ring, so hydraulics, coolant, vacuum, or compressed air cross the same rotating axis as the electrical signals. This is how a rotary table or a turret gets power, data, and pneumatics across one continuous-rotation joint. > **Rule:** Use a cable across a joint that oscillates within a bounded angle (use torsion-rated cable and a service loop). Use a slip ring when the joint must rotate continuously or through many turns — anything past a few hundred degrees of cumulative rotation is slip-ring territory. ### Selecting a slip ring Key parameters: number and type of circuits (power vs signal, and the protocol for data channels), current per ring and voltage rating, rotational speed (RPM) and whether continuous or intermittent, bore size (through-bore for fluid/shaft pass-through), IP rating, expected life (revolutions), and electrical noise spec for the signal rings. For a robot that just needs clean encoder + Ethernet + 24 V across a pan axis, a compact gold-on-gold capsule ring (Servotecnica, Moog) is the right answer. For a high-current turret with hydraulics, a large through-bore hybrid ring with a rotary union. Slip rings are wear parts. Brush rings especially have a finite revolution life and need brush inspection/replacement; fiber-brush and capsule rings last far longer but still age. Put them in the maintenance schedule. ## Labeling, harness build & service The difference between a robot you can service in ten minutes and one that eats an afternoon is documentation and labeling — decisions made at build time that pay back for the life of the machine. **Label every conductor and every connector, at both ends.** Use printed heat-shrink labels (not handwritten tape) with a scheme that matches the schematic: wire number, function, or both. When a fault hits at 2 a.m., the tech traces a labeled wire to a drawing in minutes; an unlabeled harness is a multi-hour continuity-buzzing exercise. Common schemes: number wires per the wiring diagram (W1, W2...), or function-code them (MOT1-U, ENC3-A+). Pick one and be consistent. **Build to a documented harness drawing.** A harness drawing specifies every wire's gauge, color, route, length, termination, and label. It's the build instruction and the service reference. For repeated builds, a **formboard** (a 1:1 layout board with pegs) makes harness assembly repeatable and fast. **Crimp, don't solder, for flex and vibration.** A proper crimp (with the right tool and die) makes a gas-tight cold weld that's mechanically robust and vibration-proof. A soldered joint creates a rigid section where the wire flexes right at the edge of the solder wick — a stress concentrator that fatigues and cracks. Crimp terminations on flexing and vibrating harnesses; solder only in static, supported locations. Verify crimps with a pull test against the spec. **Color conventions** help: follow regional standards for power (and your own consistent convention for signal). The point is consistency — a tech who knows your blue-is-always-24V convention works faster and makes fewer mistakes. > **Rule:** Labeling and harness documentation are design outputs, not paperwork. Budget time for them. The robot that's documented and labeled has a service time a fraction of the one that isn't, for its entire life. **Connectorize for service.** Break the harness into segments at connectors so a failed segment swaps without re-pulling the whole machine. The e-chain cable that fails should be a replaceable assembly with connectors at both ends, not a soldered-in run. This is where pre-assembled, connectorized continuous-flex harnesses (Igus readychain, Lapp) pay off twice: cycle life and serviceability. ## Failure modes & preventive maintenance Knowing how robot wiring fails tells you what to inspect and when. The dominant modes, roughly in order of frequency on a working machine: - **Flex fatigue / conductor cracking.** The #1 mode on moving axes. Strands work-harden and crack from repeated bending or torsion. Symptom: intermittent faults — flickering encoder, dropping fieldbus node, motor fault under specific arm poses — that come and go with position. Cause: under-rated cable, bend radius too tight, torsion at a wrist, or simply reaching end of cycle life. **Prevention:** rated continuous-flex/torsion cable, correct bend radius, scheduled replacement before cycle life is reached. - **Jacket abrasion / chafe-through.** Cable rubbing on structure, edges, or other cables wears through the jacket and then the insulation, eventually shorting. Symptom: insulation fault, intermittent short, sometimes a tripped GFCI/RCD. **Prevention:** proper routing, e-chain dividers, edge protection, clearance. - **Connector fretting / loosening.** Vibration micro-motion wears contacts and backs off un-locked connectors. Symptom: rising contact resistance, intermittent open, heat at the connector. **Prevention:** screw-locked vibration-rated connectors, spring-cage terminals over screw terminals, gold on signal contacts, torque to spec. - **Strain-relief failure at terminations.** Cable flexes at the connector instead of in the rated zone; conductor fatigues at the crimp. Symptom: open or intermittent right at a connector. **Prevention:** clamp the jacket, service loops, no flex at terminations. - **Shield/ground degradation.** A broken shield bond or a pigtail that fatigues lets noise in. Symptom: EMC problems that appear over time — encoder noise, comms errors. **Prevention:** 360° terminations, inspect shield bonds. - **Thermal aging.** Overloaded or over-bundled cable runs hot, ages insulation, and accelerates every other mode. **Prevention:** correct ampacity derating for bundling and ambient. - **Fluid/chemical attack.** Coolant, oil, or cleaning chemicals attack the wrong jacket material. Symptom: jacket swelling, cracking, embrittlement. **Prevention:** chemical-compatible jacket (PUR for oil/abrasion; specific grades for aggressive media), correct IP rating. ### Preventive maintenance > **Rule:** Treat dynamic cables and slip rings as wear parts with a replacement schedule, the same as bearings and belts. The cheapest failure is the one you replaced before it happened. A practical PM program: - **Visual inspection** of dress packs and e-chains on a schedule (monthly for high-duty machines): look for jacket damage, kinks, cables migrating out of compartments, chain link wear, abrasion marks, corkscrewing. - **Track cycle counts** against the cable's rated life and schedule replacement at a fraction (e.g. 70–80%) of rated cycles, before the failure window. The robot controller often logs joint motion — use it to estimate flex cycles. - **Thermal check** under load with a thermal camera or spot probe: hot connectors mean rising contact resistance (fretting); hot cable runs mean overload or over-bundling. - **Connector inspection and re-torque** at intervals; verify locking, look for corrosion, re-seat washdown caps on unmated ports. - **Slip ring service** per the manufacturer: brush inspection/replacement for brush rings, contact-resistance and noise checks for signal rings. - **Keep spares of the dynamic assemblies** — the e-chain cable harness, the dress pack, the slip ring brushes. Pre-assembled connectorized harnesses turn a multi-hour failure into a ten-minute swap. The whole philosophy: the static wiring you build once and forget; the dynamic wiring you design as a fatigue component, route within its bend radius, document, label, and replace on a schedule. Do that and wiring stops being your top field-failure mode and goes back to being plumbing — the way it should have been all along. ## Frequently asked questions **Can I use ordinary stranded hookup wire in a drag chain?** No. Ordinary (coarse-strand) hookup or panel wire work-hardens and cracks within weeks to months under continuous flex. Use rated continuous-flex cable (Igus chainflex, Lapp Ölflex FD/Chain) for e-chains, and torsion-rated cable (chainflex CFROBOT) for robot arm joints. The 2–5× cost premium buys roughly 100× the cycle life. **What's the difference between continuous-flex and high-flex cable?** They're the same idea, marketed under different names. "Continuous-flex," "high-flex," "flexible," and "chain-suitable" all mean cable engineered with fine high-strand-count conductors, bundle stranding, and a low-friction jacket for repeated bending. The thing to check is the *rated bend-cycle life at a stated bend radius* — that number, not the adjective, tells you what you're buying. **How do I pick a wire gauge for a motor feed?** Satisfy two limits. First ampacity (with bundle and ambient derating) so the cable doesn't overheat. Then voltage drop — compute `V_drop = I × ρ × L_roundtrip / A` and keep it under ~3% of the bus voltage. On low-voltage DC robots, voltage drop usually forces a bigger gauge than ampacity alone. For drive-to-motor specifically, follow the drive maker's table since PWM RMS current and EMC both factor in. **A-code, D-code, X-code — what do M12 codes mean?** The code is a mechanical keying that matches the connector to its signal type: A-code for sensors/DC power and CAN/DeviceNet, B-code for old PROFIBUS, D-code for 100 Mbit Ethernet (PROFINET/EtherCAT), X-code for Gigabit Ethernet, and L/T/S/K codes for various AC/DC power. The keying physically prevents plugging a power lead into a data port. **Should I ground a cable shield at one end or both?** Frequency-dependent. Ground at *one end* for low-frequency analog signals to avoid a ground loop. Ground at *both ends* with a 360° backshell/EMC-gland termination for high-frequency signals, data cables, and motor/drive cables, where both-end grounding is more effective against the dominant coupling. Never use a pigtail on a high-frequency cable — it ruins shield performance. **Why does my encoder glitch only when the motor runs hard?** Almost always EMI from the motor/drive cable coupling into the encoder cable. Check: are power and signal cables running parallel and close? Separate them. Is the motor cable shielded and bonded 360° at both ends? Is the encoder shield grounded correctly? Is the encoder cable continuous-flex and intact (not a cracked-strand intermittent)? See [encoders](/posts/encoders-ultimate-guide/) and [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/). **When do I need a slip ring instead of a cable?** When the joint rotates continuously or through many turns. A cable (torsion-rated, with a service loop) handles bounded oscillation — typically up to a few hundred degrees. Past that, the cable winds up and fails, and you need a slip ring. Turrets, unlimited pan axes, rotary tables, and cable reels are slip-ring applications. **Can a slip ring carry Ethernet?** Yes. Modern gold-on-gold fiber-brush and capsule slip rings (Moog, Servotecnica) carry Gigabit Ethernet, EtherCAT, PROFINET, USB, CAN, and video across a rotating joint, alongside power rings and — with a through-bore and rotary union — fluid/pneumatic lines. Specify the data protocol and rate explicitly; the highest rates may use dedicated contactless rotary couplers. **How tight can I bend a continuous-flex cable?** Down to the cable's rated dynamic bend radius, typically 7.5–12.5× the outer diameter (×d) for e-chain use, more for torsion. Fixed installation tolerates 4–5×d. Data cables are often stricter. Always use the largest required radius in a bundle as the design radius, and choose the e-chain's bend radius to be ≥ the largest cable's minimum. **Screw terminals or push-in (spring-cage) terminals?** Spring-cage / push-in (Wago, Phoenix Contact) for anything that vibrates — they're gas-tight and don't loosen. Screw terminals loosen under vibration and thermal cycling and need re-torquing. On a robot, prefer spring-cage; if you must use screws, schedule re-torque inspections. **How do I size an e-chain?** Pick the chain's bend radius (KR) to be ≥ the largest cable's minimum dynamic bend radius. Lay the cables out with dividers — heaviest outside, lightest inside, power separated from signal — with ~10% diameter clearance (20% for long travel needing axial movement). Keep fill under ~60–80% of the interior. Check travel against the unsupported limit; add a guide trough for long gliding runs. Igus and Kabelschlepp have online configurators. **Should I build harnesses or buy pre-assembled?** For low to medium volumes, pre-assembled connectorized continuous-flex harnesses (Igus readycable/readychain, Lapp) usually win on total cost: no tooling, factory-tested IP and continuity, and the same cycle-life guarantee as the raw cable. They also turn a field failure into a fast connectorized swap. Build your own when volume justifies the formboard and tooling, or when the geometry is too custom for catalog assemblies. ## Changelog - **2026-06-15** — Initial publication. --- # Robot Safety & Functional Safety: The Ultimate Guide URL: https://blog.robo2u.com/posts/robot-safety-functional-safety-ultimate-guide/ Published: 2026-06-14 Updated: 2026-06-20 Tags: robot-safety, functional-safety, iso-10218, iso-13849, risk-assessment, emergency-stop, sil-pl, safety-rated, guide Reading time: 39 min > An engineer's deep dive into robot functional safety: ISO 12100, ISO 10218-1/-2 (2025), ISO/TS 15066, ISO 13849 PL & Categories, IEC 62061 SIL, stop categories, STO/SS1/SLS, ISO 13855 distances, safe fieldbuses, and validation. There is a comfortable lie in this industry that safety is a paperwork problem — that you buy a CE-marked robot, hire someone to fill in a risk-assessment template, glue a yellow fence around the cell, and the auditor goes away happy. That lie kills people. Not often, because the standards are good, but it kills people. The standards work precisely because somebody, somewhere, treated them as engineering — as a set of quantitative requirements about how reliably a stop function will execute when a hand is where it shouldn't be. This guide is the long version for the people who actually own the risk: the controls engineers, the integrators, the machine builders, and the safety engineers who sign the Declaration of Conformity. We will walk the full stack — from why functional safety exists, through the standards map (ISO 12100 down to ISO/TS 15066), through risk assessment, the safety functions themselves (E-stop, protective stop, STO/SS1/SS2/SLS), guarding hardware, and then the quantitative core: Performance Level under ISO 13849-1 and SIL under IEC 62061, with worked examples. Real numbers with units, opinions with the reasons attached. **The take**: Functional safety is engineering, not paperwork. The document trail is the *evidence* that the engineering happened — it is not the engineering. A safety function has a measurable probability of dangerous failure per hour, a measurable response time, and a measurable stopping distance, and if you cannot put numbers with units on all three, you have not designed a safety function — you have decorated a machine with safety-coloured components and hoped. Buy the architecture first (the redundancy, the diagnostics, the rated components), then let the paperwork record what you built. Companion reading: [collaborative robots (cobots)](/posts/collaborative-robots-cobots-ultimate-guide/), [industrial robot arms](/posts/industrial-robot-arms-ultimate-guide/), [industrial automation: PLC, SCADA & fieldbus](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/), and [mobile robots: AMR & AGV](/posts/mobile-robots-amr-agv-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [Why functional safety exists](#why-functional-safety) 3. [The standards map (Type A/B/C)](#standards-map) 4. [The risk assessment process](#risk-assessment) 5. [Safety functions: E-stop, protective stop, STO/SS1/SS2/SLS/SOS](#safety-functions) 6. [Guarding & safeguards](#guarding) 7. [Performance Level (ISO 13849-1)](#performance-level) 8. [SIL (IEC 62061 / IEC 61508) and PL↔SIL mapping](#sil) 9. [Safety PLCs, safe I/O & safety fieldbuses](#safety-controls) 10. [Minimum distance & guard placement (ISO 13855)](#minimum-distance) 11. [Cobots & collaborative safety vs traditional guarding](#cobots) 12. [AMR / mobile machine safety (ISO 3691-4, R15.08)](#mobile) 13. [Validation, documentation & CE compliance](#validation) 14. [Frequently asked questions](#faq) ## Key takeaways - **Functional safety is a probability argument.** A safety function is not "safe" or "unsafe" — it has a *probability of dangerous failure per hour* (PFHD) or an *average probability of dangerous failure on demand*, and the whole discipline is about driving that number low enough for the risk you face. ISO 13849-1 calls the result a Performance Level (PL a–e); IEC 62061/61508 call it a Safety Integrity Level (SIL 1–4). - **Three standard types, in order.** ISO 12100 (Type A, principles) sits on top; ISO 13849-1 and IEC 62061 (Type B, generic functional safety) sit in the middle; ISO 10218-1/-2 and ISO 3691-4 (Type C, machine-specific) sit at the bottom and **take precedence** where they deviate. - **ISO 10218 was revised in 2025.** The new ISO 10218-1:2025 (robot) and ISO 10218-2:2025 (robot system/cell integration) folded the bulk of ISO/TS 15066's collaborative content into the normative standards and tightened requirements on safety-rated functions. They replaced the 2011 editions. - **Risk assessment drives everything (ISO 12100).** Identify hazards, estimate risk from **severity × frequency/duration of exposure × probability × possibility of avoidance**, then reduce by the hierarchy: inherently safe design → safeguarding → information for use. PLr (the *required* PL) comes straight out of the ISO 13849-1 risk graph. - **Stop categories (IEC 60204-1) are about power, not speed.** Category 0 = immediate removal of power (uncontrolled stop). Category 1 = controlled stop *then* remove power. Category 2 = controlled stop with power maintained. E-stop must be Cat 0 or Cat 1 only. - **Drive-integrated safety functions (IEC 61800-5-2) are the modern toolkit.** STO, SS1, SS2, SLS, SOS, SLP, SDI and friends live in the servo drive itself, so you stop or limit motion without dropping a contactor. STO underpins Cat 0; SS1 underpins Cat 1. - **Performance Level is an architecture, not a component.** PL comes from Category (B/1/2/3/4) + MTTFD + DCavg + CCF, evaluated per channel. You cannot buy "PL e" — you build a Category 3 or 4 dual-channel structure with diagnostics and good components and then *verify* you reached PL e. - **PL and SIL map, but are not identical.** PL e ≈ SIL 3, PL d ≈ SIL 2, PL c ≈ SIL 1 — but the mapping is via PFHD bands, and the two standards use different design methods. Pick one standard per project and stay in it. - **ISO 13855 sets the standoff distance.** `S = K·T + C`. The detection device must be far enough that the machine reaches a safe state before the body part reaches the hazard. Forget the C term (intrusion) and your light curtain is decorative. - **Light curtains and scanners are governed by IEC 61496.** Type 4 ESPE for the highest demands; resolution (14 mm finger, 30 mm hand, 40+ mm body) sets both the detection capability and the C term in the distance formula. - **Cobots don't remove the safety case — they change it.** Power & force limiting (ISO/TS 15066, now in ISO 10218) replaces separation with biomechanical force limits. Speed & separation monitoring replaces fences with safety-rated scanners. Both are *more* demanding to validate than a fence, not less. - **Mobile machines have their own Type C standard.** ISO 3691-4 (industrial trucks / driverless) and ANSI/RIA R15.08 (industrial mobile robots) govern AMRs — safety-rated speed, scanner fields that scale with speed, and stop performance under load. - **Validation is half the job.** ISO 13849-2 / IEC 62061 require you to *prove* the safety functions, including fault injection. A SISTEMA file and a stop-time measurement are evidence; an unverified calculation is a wish. ## Why functional safety exists Start with the physics, because the physics is why the law exists. An industrial six-axis arm moving a 50 kg payload at 2,000 mm/s carries on the order of several hundred joules of kinetic energy in the payload alone, plus far more in the arm's own moving mass. A human skull fractures at impact energies in the tens of joules. The robot does not slow down because a person walked in; it has no idea the person is there unless you gave it a way to know. That gap — between what the machine can do and what a human body can survive — is the hazard, and it does not negotiate. > **Safety rule:** A machine is dangerous by default. Safety is a property you *add* through engineering. The absence of an accident yesterday is not evidence of safety today; it is evidence that nobody was in the wrong place yet. The duty of care is both moral and legal. In the EU, the Machinery Regulation 2023/1230 (which replaces the Machinery Directive 2006/42/EC, with the Regulation applying from 20 January 2027) makes the machine builder legally responsible for placing a safe machine on the market — CE marking, a Declaration of Conformity, and a technical file that demonstrates conformity to the essential health and safety requirements. In the US, OSHA's general duty clause and the adoption of consensus standards (ANSI/RIA R15.06, NFPA 79) do the equivalent work. In both regimes the burden sits on whoever puts the machine into service. Functional safety is the specific slice of this that concerns *active* protective measures — the ones that depend on a system correctly detecting a condition and reacting. A fixed fence is a safety measure but not a *functional* one: it works by being there, with no logic to fail. A light curtain that trips a stop *is* functional safety: it has sensors, logic, and outputs, every one of which can fail, and the question becomes *how reliably does the protective function execute on demand?* That word — reliably, quantified — is the whole game. The historical arc matters. IEC 61508 (1998, revised 2010) was the foundational generic functional-safety standard, written largely from the process-industry tradition — it gave us SIL and the PFHD framework. The machinery world found 61508 heavy and abstract, so it produced two machine-friendly children: ISO 13849-1 (evolving from the old EN 954-1 Categories into a probabilistic PL framework) and IEC 62061 (a machinery-sector application of 61508 keeping SIL). Robots, being machines with extra ways to hurt you, got their own Type C standard, ISO 10218, sitting on top of all of it. ## The standards map (Type A/B/C) If you take one structural idea from this guide, take this one: standards are layered, and the layer closest to your machine wins. ISO classifies safety standards into three types: - **Type A (basic safety standards)** state general principles applicable to all machinery. There is essentially one: **ISO 12100** — *Safety of machinery — General principles for design — Risk assessment and risk reduction*. It is the constitution. - **Type B (generic safety standards)** deal with one safety aspect (B1) or one safeguard (B2) across many machine types. The functional-safety heavyweights — **ISO 13849-1/-2**, **IEC 62061**, **IEC 61508** — are Type B1. Guarding and device standards like **IEC 60204-1** (electrical equipment), **IEC 61496** (ESPE / light curtains & scanners), **ISO 13855** (positioning of safeguards), **ISO 13850** (E-stop), and **ISO 14119** (interlocks) are Type B. - **Type C (machine-specific safety standards)** address a particular machine or machine group. **ISO 10218-1** (robots) and **ISO 10218-2** (robot systems and integration), **ISO 3691-4** (driverless industrial trucks), and the ANSI/RIA **R15.06** / **R15.08** family are Type C for robotics. > **Safety rule:** Where a Type C standard deviates from a Type A or B standard, the Type C standard takes precedence for that machine. ISO 10218 beats ISO 13849 on any point where they conflict — but ISO 10218 *uses* ISO 13849 for the functional-safety maths, so in practice you apply both. The conceptual flow for robotics is: **ISO 12100** gives you the risk-assessment method and the risk-reduction hierarchy → **ISO 10218-1/-2** tells you which safety functions a robot cell needs and what Performance Level each requires → **ISO/TS 15066** (now largely absorbed into ISO 10218:2025) gives you the collaborative-operation detail and the biomechanical limits → **ISO 13849-1 / IEC 62061** give you the method to *engineer and prove* each function to its required PL/SIL → the device standards (**IEC 61496**, **ISO 13855**, **IEC 60204-1**, **IEC 61800-5-2**) tell you how the components and distances must behave. | Standard | Type | Scope | What you use it for | |---|---|---|---| | ISO 12100 | A | General principles, risk assessment | The master method: hazard ID, risk estimation, reduction hierarchy | | ISO 13849-1 / -2 | B | Functional safety (PL) | Designing & validating safety functions by Performance Level | | IEC 62061 | B | Functional safety (SIL) for machinery | Same job as 13849-1 but in SIL terms; complex/programmable systems | | IEC 61508 | B | Generic functional safety (SIL) | The parent standard; used directly for novel safety devices/PLCs | | IEC 60204-1 | B | Electrical equipment of machines | Stop categories 0/1/2, E-stop wiring, supply disconnection | | IEC 61496-1/-2 | B | Electro-sensitive protective equipment | Light curtains, laser scanners — types, performance | | ISO 13855 | B | Positioning of safeguards | Minimum distance `S = K·T + C` | | ISO 13850 | B | Emergency stop | E-stop function design, reset, categories | | ISO 14119 | B | Interlocking devices with guards | Guard interlock selection, defeat resistance | | IEC 61800-5-2 | B | Adjustable-speed drives — safety | STO, SS1, SS2, SLS, SOS, SLP and other drive safety functions | | ISO/TS 15066 | (TS) | Collaborative robots | Biomechanical force/pressure limits; folded into ISO 10218:2025 | | ISO 10218-1:2025 | C | Industrial robots (the robot) | Requirements on the robot's built-in safety functions | | ISO 10218-2:2025 | C | Robot systems & integration | The cell: guarding, layout, validation, collaborative ops | | ISO 3691-4 | C | Driverless industrial trucks | AMR/AGV safety: speed, detection fields, stop performance | | ANSI/RIA R15.06 | C | Industrial robots (US) | US adoption aligned with ISO 10218 | | ANSI/RIA R15.08 | C | Industrial mobile robots (US) | US standard for AMRs | ## The risk assessment process Everything quantitative downstream — the required PL, the choice of stop category, the standoff distance — is an *output* of the risk assessment. Get this wrong and every number after it is wrong with confidence. ISO 12100 defines the loop: determine the limits of the machine → identify hazards → estimate risk → evaluate risk → reduce risk → repeat until acceptable. Run it for every life-cycle phase (installation, operation, cleaning, maintenance, decommissioning), not just normal production. Maintenance is where most people die, because that's when the guards are open and the energy isn't always isolated. **Hazard identification** for a robot cell is mechanical first — crushing, shearing, impact, entanglement, drawing-in at the robot, the end effector, the workpiece, and ancillary equipment (conveyors, positioners, presses). Then the rest: electrical, thermal (welding, hot parts), noise, radiation (laser, vision illuminators), and ergonomic. The end effector and the workpiece are part of the machine — a robot holding a knife is a different hazard than the same robot holding a foam pad. People forget this constantly. **Risk estimation** combines, for each hazard, the **severity of harm** (S) with the **probability of occurrence of that harm**, where probability is built from three factors: - **Frequency and duration of exposure** (F) — how often and how long is someone in the danger zone? - **Probability of occurrence of the hazardous event** (O) — how likely is the thing to go wrong? - **Possibility of avoidance** (A) — can the person get out of the way, given the speed and warning? ISO 13849-1 turns exactly these into a **risk graph** that outputs the required Performance Level, PLr: ``` P1 (possible to avoid) F1 ──────► PL_r = a S1 ────────► P2 ─────► PL_r = b Start ──► (slight) S2 ──────► F1 ──────► P1 ─────► PL_r = c (serious/ (seldom) P2 ─────► PL_r = d irreversible) F2 ──────► P1 ─────► PL_r = d (frequent) P2 ─────► PL_r = e S = severity F = frequency/exposure P = possibility of avoidance ``` Read it plainly: a serious, irreversible injury (S2) from a hazard you are exposed to frequently (F2) and cannot avoid (P2) demands **PLr = e** — the highest. Most robot protective stops and E-stops land at **PLr = d**; a few isolated, low-exposure functions sit at c. **Risk reduction** then follows a strict, non-negotiable hierarchy — the three-step method: 1. **Inherently safe design** — eliminate the hazard or reduce it at the source. Lower the speed, lower the energy, round the edges, remove the pinch point, design out the trapped position. This is the cheapest and most reliable reduction because it removes the need for the function to *work*. A hazard that isn't there cannot fail to be guarded. 2. **Safeguarding and complementary protective measures** — guards, interlocks, light curtains, scanners, two-hand controls, E-stops. This is functional safety territory: you are now relying on systems that can fail, so you must quantify them. 3. **Information for use** — warnings, signs, training, PPE, safe working procedures. The weakest layer, because it relies on humans behaving. Never the primary measure for a serious hazard. > **Safety rule:** You may not skip up the hierarchy. If a hazard can be designed out, designing it out is mandatory before reaching for a light curtain. Safeguarding is what you apply to the residual risk *after* inherently safe design, not instead of it. The output of the assessment is a list of required safety functions, each with a PLr (or SILCL), a stop category, and the reaction time and distance constraints they must satisfy. That list is the specification for everything that follows. ## Safety functions: E-stop, protective stop, STO/SS1/SS2/SLS/SOS A *safety function* is a defined function whose failure increases risk — e.g. "when the light curtain is interrupted, the robot performs a Category 1 stop within 0.5 s." It has inputs (sensors), logic (safety controller), and outputs (actuators/drives), and the whole chain carries the PL/SIL. ### Stop categories (IEC 60204-1) The single most misunderstood concept in the field. Stop categories describe how *power* is handled, not how fast the machine stops. | Category | Behaviour | Power | Typical use | |---|---|---|---| | **Category 0** | Uncontrolled stop — immediate removal of power to the actuators | Removed immediately | E-stop where coasting is acceptable/safer; high-risk where you want power gone now | | **Category 1** | Controlled stop — actuators powered to brake, *then* power removed once stopped | Removed after stop | Most servo machines: brake under control, then drop power. Cleanest for high inertia | | **Category 2** | Controlled stop with power maintained (machine stays energized, holds position) | Maintained | Operational stops, not for emergency use; SOS-style standstill monitoring | > **Safety rule:** An emergency stop must be Category 0 or Category 1 only (IEC 60204-1 / ISO 13850). Category 2 is *never* an E-stop, because it leaves the machine powered. A safety-rated monitored stop in a cobot SRMS mode is a Category 2 stop — it is a *protective* stop, not an *emergency* stop, and the two are not interchangeable. The distinction between **emergency stop** and **protective (safeguarding) stop** matters legally and functionally: - **Emergency stop** is a *complementary* measure — the manual, last-resort red mushroom. It is not a primary safeguard and you cannot count on a human to press it in time. It exists for the case where everything else failed. Requires manual reset. - **Protective stop** (also "safeguarded stop") is the *automatic* stop triggered by a safeguard — light curtain broken, gate opened, scanner field violated. This is your workhorse safety function. It may auto-resume (SRMS) or require reset depending on the mode. ### Drive-integrated safety functions (IEC 61800-5-2) The old way to stop a servo was to drop a contactor between the drive and the motor — crude, slow to reset, and hard on the hardware. Modern servo drives implement *safe motion functions* inside the drive electronics, certified to IEC 61800-5-2, so you stop or constrain motion without breaking the power path. These are the building blocks of every modern robot safety architecture: - **STO — Safe Torque Off.** The drive stops delivering torque-producing energy to the motor. The motor coasts (or is held by a mechanical brake). STO is the foundation of a **Category 0** stop. It does *not* by itself decelerate the load — a vertical axis will drop unless a brake holds it. - **SS1 — Safe Stop 1.** Commanded deceleration along a ramp, then STO once standstill (or a time limit) is reached. This is the **Category 1** stop. Best choice for high-inertia robot axes — you brake under control, then remove torque. - **SS2 — Safe Stop 2.** Commanded deceleration to standstill, then transition to **SOS** with power maintained. This is the **Category 2** stop. - **SOS — Safe Operating Stop.** The drive holds the motor at standstill and *monitors* that it stays there, reacting if the position deviates beyond a safe window — without removing power. This is what lets a cobot hold position safely while a human loads a part. - **SLS — Safely Limited Speed.** The drive monitors that speed stays below a safe limit and reacts (typically SS1/SS2) if exceeded. The backbone of reduced-speed teach modes and speed-&-separation monitoring. - **SLP — Safely Limited Position** (safe zones / soft axis limits), **SDI — Safe Direction**, **SLA — Safely Limited Acceleration**, **SBC — Safe Brake Control**, **SBT — Safe Brake Test** round out the toolkit. For the control-loop side of how these execute deterministically, see [real-time control systems](/posts/real-time-control-systems-ultimate-guide/) and, for the drive internals, [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/). > **Safety rule:** STO removes torque; it does not stop motion. On any axis with stored energy — gravity, springs, momentum — STO without a safe brake (SBC) or a controlled deceleration (SS1) is a dropping load waiting to happen. Choose SS1 for high-inertia axes and verify the brake with SBT. A robot's typical safety-function set: an E-stop (Cat 1 via SS1, PL d/SIL 2), a protective stop from the cell's safeguards (Cat 1 or 2, PL d), safely-limited speed for teach/manual mode (SLS at 250 mm/s TCP, a hard ISO 10218 limit for manual reduced speed), safe zones (SLP) to keep the arm out of a neighbouring cell, and — on a cobot — safe force/power limiting. ## Guarding & safeguards Safeguards are the physical and sensing layer that implements the protective stop. The choice between them is dictated by whether the operator needs *access* and how often. **Fixed guards** — bolted or welded enclosures, removable only with a tool. No logic, no failure mode, the most reliable thing you can install. Use them wherever routine access isn't needed. A fixed perimeter fence is still the cheapest, most robust safeguard for a fast industrial arm, and the engineering snobbery against fences is misplaced — a fence that's always there beats a scanner that might be misaligned. **Interlocked movable guards** (ISO 14119) — gates and doors whose opening triggers a stop. The interlock device (the bit that detects the guard's position) must itself be selected for the required PL and for *defeat resistance* — coded magnetic or RFID interlocks resist the classic "tape a spare actuator to the frame" defeat that plagues simple mechanical switches. Add guard locking (power to unlock) where the machine takes time to reach a safe state, so the gate cannot open until the robot has actually stopped. **Electro-sensitive protective equipment (ESPE)** under IEC 61496 — the non-contact safeguards: - **Light curtains** (IEC 61496-1/-2): arrays of infrared beams forming a detection plane. Specified by **resolution** — 14 mm (finger detection), 30 mm (hand), 40+ mm (body/access). Resolution sets the detection capability *and* feeds the C term in the ISO 13855 distance formula. **Type 4** is the highest performance/integrity class (suitable up to PL e / SIL 3); Type 2 is for lower-demand applications. Add muting (for material to pass while people can't) and blanking carefully — both are classic ways to defeat a curtain. - **Safety laser scanners** (IEC 61496-3, which covers active opto-electronic protective devices responsive to diffuse reflection — AOPDDR): a rotating beam sweeps a 2D plane, defining warning and protective fields you can shape to the cell. The workhorse for floor-level access detection and for AMRs. Resolution is coarser (typically 30–70 mm), so the C term is larger. - **3D / vision-based protective devices**: time-of-flight and stereo systems creating safety-rated volumes. The enabling tech for speed-&-separation monitoring around cobots. Newer, more expensive, and more demanding to validate. **Two-hand control devices** (ISO 13851 / IEC 60204-1) — both hands occupied on widely-spaced buttons that must be pressed within ~0.5 s of each other and held, so the operator's hands cannot be in the hazard during the dangerous motion. Type III C is the high-integrity form. Protects only the operator pressing the buttons — not a colleague reaching in. **Safety mats and edges** (ISO 13856) — pressure-sensitive floor mats and trip edges that detect presence by weight or contact. Robust and intuitive, but bulky and prone to nuisance trips; largely displaced by scanners for new cells. > **Safety rule:** Every non-contact safeguard has a way to be defeated, and operators *will* find it if the machine is annoying to use. The most common cause of a guarded machine becoming unsafe is not component failure — it is a frustrated operator who muted, blanked, taped, or bypassed the safeguard to keep production moving. Design the safeguard so the easy path is the safe path. For where the safeguards live in the broader control architecture — the safety PLC, the safe I/O, the network — see [industrial automation: PLC, SCADA & fieldbus](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/). ## Performance Level (ISO 13849-1) This is the quantitative heart of machinery functional safety, and the part most people fudge. ISO 13849-1 assigns each safety function a **Performance Level (PL)** from **a** (lowest) to **e** (highest), defined by the average probability of a dangerous failure per hour: | PL | PFHD (per hour) | Rough equivalent | |---|---|---| | a | ≥ 10⁻⁵ to < 10⁻⁴ | Low risk reduction | | b | ≥ 3×10⁻⁶ to < 10⁻⁵ | | | c | ≥ 10⁻⁶ to < 3×10⁻⁶ | ≈ SIL 1 | | d | ≥ 10⁻⁷ to < 10⁻⁶ | ≈ SIL 2 | | e | ≥ 10⁻⁸ to < 10⁻⁷ | ≈ SIL 3 | The achieved PL of a safety function is **not** a property you buy on a component. It emerges from the *architecture* of the function — the whole chain from sensor to logic to output — characterised by five parameters: - **Category (B, 1, 2, 3, 4)** — the structural architecture and its behaviour under fault. This is the dominant lever. - **MTTFD** — Mean Time To dangerous Failure of each channel, capped and binned: *Low* (3 to <10 years), *Medium* (10 to <30 years), *High* (30 to 100 years). Built up from component B10D values and duty cycles. - **DC (Diagnostic Coverage)** — the fraction of dangerous failures the diagnostics detect, binned: *None* (<60%), *Low* (60 to <90%), *Medium* (90 to <99%), *High* (≥99%). DCavg is the averaged figure across the function. - **CCF (Common Cause Failure)** — for redundant architectures, the score that confirms your two channels won't fail together from one cause (shared power supply, shared connector, overtemperature). ISO 13849-1 requires a CCF score ≥ 65 points from its checklist. - **Systematic failures** — design and implementation faults, controlled by measures (not a number you compute). The five **Categories** describe how the architecture behaves: - **Category B** — basic. A single channel; a fault can cause loss of the safety function. PL a–b only. - **Category 1** — single channel using *well-tried* components and principles. Higher reliability than B but still single-fault-vulnerable. PL b–c. - **Category 2** — single channel *with periodic testing* by the logic. A fault is detected at the next test, not instantly, so there's a window of vulnerability. The test rate must be ≥ 100× the demand rate. PL up to d. - **Category 3** — redundant, dual-channel, so a *single* fault does not lose the safety function and (where reasonable) is detected. Single-fault tolerant. PL up to e. - **Category 4** — redundant with *high* diagnostic coverage, so a single fault is detected and an accumulation of faults still doesn't lose the function. The gold standard. PL e. The PL is then read off the ISO 13849-1 bar chart (Annex K / Figure 5) from Category, DCavg, and MTTFD. In practice everyone uses the free **SISTEMA** tool from the German IFA, which holds the component library and does the maths. ### A worked example Specify a robot protective stop: light curtain → safety relay/PLC → two contactors (or STO via SS1) cutting motion. Risk graph gave **PLr = d**. ``` Architecture: Category 3 (dual channel, single-fault tolerant) Channel 1: Type 4 light curtain (B10d = 2.0e6 ops) Channel 2: identical, diverse routing Logic: dual-channel safety controller (PFHd ≈ 1e-9 /h, certified PL e) Output: redundant STO inputs on the servo drive (PFHd ≈ 1e-9 /h) MTTFd per channel: capped at HIGH (30–100 years) DCavg: MEDIUM–HIGH (cross-monitoring + drive STO diagnostics) CCF: score = 70 points (≥ 65 required → pass) Category 3 + DCavg medium + MTTFd high → PL e achieved PFHd (system, series sum) ≈ 3e-8 /h → well inside PL d band, reaches PL e ``` PLr was d; the architecture achieved e, so the function passes with margin. Note the maths is a *series* sum of the subsystem PFHD values — sensor + logic + output add up, and the weakest link dominates. A PL e controller wired to a single-channel Category B sensor is a Category B function. **The chain is only as good as its worst subsystem.** > **Safety rule:** You cannot specify your way to PL e by buying a PL e controller. PL is an end-to-end property of sensor + logic + actuator. Compute the whole chain, every time, and let the lowest subsystem set the ceiling. ## SIL (IEC 62061 / IEC 61508) and PL↔SIL mapping IEC 62061 does the same job as ISO 13849-1 but in the language of **Safety Integrity Level (SIL)**, inherited from IEC 61508. For high-demand / continuous-mode operation (which is what robot safety functions are), SIL is defined by the same PFHD bands: | SIL | PFHD (per hour, high-demand mode) | ≈ PL | |---|---|---| | SIL 1 | ≥ 10⁻⁶ to < 10⁻⁵ | PL c (and part of b) | | SIL 2 | ≥ 10⁻⁷ to < 10⁻⁶ | PL d | | SIL 3 | ≥ 10⁻⁸ to < 10⁻⁷ | PL e | | SIL 4 | ≥ 10⁻⁹ to < 10⁻⁸ | (not used in machinery) | SIL 4 belongs to the process and rail worlds; machinery functions top out at SIL 3 (= PL e). IEC 62061 reaches its SIL via a **SIL Claim Limit (SILCL)** per subsystem, built from architectural constraints (the *hardware fault tolerance*, HFT, and the *safe failure fraction*, SFF) plus the PFHD. It is generally the better fit for complex, programmable, software-heavy safety systems; ISO 13849-1 is the better fit for conventional electromechanical and simpler architectures. Both standards are listed as harmonised / valid for the Machinery Regulation, and as of the 2021/2024 revisions each now explicitly permits using the other's results — you can mix subsystems characterised in PL with subsystems characterised in SIL, as long as you convert through PFHD. Here is the honest mapping, the table everyone wants: | Performance Level (ISO 13849-1) | PFHD band (/h) | SIL (IEC 62061/61508) | |---|---|---| | PL a | 10⁻⁵ to <10⁻⁴ | — (below SIL 1) | | PL b | 3×10⁻⁶ to <10⁻⁵ | SIL 1 (lower part) | | PL c | 10⁻⁶ to <3×10⁻⁶ | SIL 1 | | PL d | 10⁻⁷ to <10⁻⁶ | SIL 2 | | PL e | 10⁻⁸ to <10⁻⁷ | SIL 3 | > **Safety rule:** PL and SIL map through PFHD, but they are *different design methods* with different architecture rules. Choose one standard per project and stay in it. Quoting "PL d / SIL 2" on a datasheet is fine for components; running half your analysis in one method and half in the other is how mistakes hide. The practical guidance: most machine builders default to ISO 13849-1 because SISTEMA and the Category model are intuitive and the component data is everywhere. Reach for IEC 62061 when the safety logic is genuinely complex — large safety PLC programs, lots of interacting functions, mixed technologies — where 62061's more rigorous treatment of systematic and software failures earns its keep. ## Safety PLCs, safe I/O & safety fieldbuses The logic layer of a modern robot cell is a **safety PLC** (or the safety processor inside the robot controller), with **safe I/O** modules, talking over a **safety fieldbus**. All of it is certified hardware — you do not build PL e logic out of a standard PLC. A safety PLC differs from a standard PLC in that the whole device — dual processors running in lockstep with cross-checking, self-test on every scan, certified safety function blocks — is rated to a PL/SIL (typically PL e / SIL 3). You program it in a restricted, certified subset (often per IEC 61131-3 with a safety-qualified compiler and locked-down function blocks). The safety program is separate from, and protected against, the standard control program. **Safe I/O** modules apply the same rigour to the edges: dual input channels with discrepancy monitoring (so a stuck or shorted contact is detected), test pulses on outputs to verify they can actually de-energize, and OSSD (output signal switching device) outputs that pulse-test continuously. **Safety fieldbuses** carry safety data over standard industrial networks using the **black channel** principle: the safety protocol wraps each safety message in its own integrity layer — sequence numbers, time stamps/watchdogs, a safety CRC, and a unique connection ID — so the *transport* network underneath can be ordinary, uncertified, even shared with non-safety traffic. The safety layer detects corruption, repetition, loss, delay, insertion, and misrouting of messages on its own. The three dominant flavours: - **PROFIsafe** — the safety layer over PROFINET (and PROFIBUS). Certified to SIL 3 / PL e. - **CIP Safety** — the safety layer over EtherNet/IP (and DeviceNet). SIL 3 / PL e. The Rockwell / ODVA ecosystem. - **FSoE (Fail Safe over EtherCAT / Safety over EtherCAT)** — the safety layer over EtherCAT. SIL 3 / PL e. Common in motion-centric and robot systems for its low latency. > **Safety rule:** The black channel means the network's reliability is irrelevant to the safety integrity — the safety protocol detects every relevant communication fault itself. This is why you can run safety and standard traffic on one cable. But the *safety endpoints* (the F-Host and F-Devices) still carry the full PL/SIL, and the network's worst-case latency still counts against your stop-time budget. That last point bites people: the fieldbus adds latency to the safety function's reaction time, and that latency goes straight into the ISO 13855 distance calculation below. A 30 ms scanner response plus a 20 ms network round-trip plus a 200 ms stop time is a 250 ms total — and at 1.6 m/s walking speed that's 0.4 m of travel you must account for. For more on how these networks behave and their determinism, see [industrial automation: PLC, SCADA & fieldbus](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/). ## Minimum distance & guard placement (ISO 13855) A light curtain or scanner is only as good as its *placement*. The whole point is that the machine reaches a safe state before the body part reaches the hazard. ISO 13855 gives the formula for the minimum standoff distance: ``` S = (K × T) + C where S = minimum distance (mm) from the detection zone to the hazard K = approach speed of the body part (mm/s) — 2000 mm/s for hand/arm approach (perpendicular, normal case) — 1600 mm/s often used for walking/whole-body approach T = total system stopping time (s) T = t1 (detection + safety system response) + t2 (machine stop time) C = intrusion distance (mm) — how far a body part can reach through/past the field before detection ``` The **C term** is where light curtains and scanners diverge sharply, because it depends on the detection capability (resolution) of the device: - For a light curtain detecting fingers/hands (resolution d ≤ 40 mm), the perpendicular intrusion term is `C = 8 × (d − 14) mm`, with C not less than 0. A 14 mm finger curtain gives C = 0; a 30 mm curtain gives C = 128 mm. - For body-detection devices with resolution > 40 mm (and for floor-mounted scanners), C is larger — a flat 850 mm for reaching over a horizontal scanner field, and additional height-dependent terms for scanners detecting an approaching person standing up. A worked perpendicular hand-approach case, vertical light curtain, d = 14 mm: ``` K = 2000 mm/s (hand/arm approach) T = t1 + t2 = 0.030 s (ESPE response) + 0.250 s (robot SS1 stop) = 0.280 s C = 8 × (14 − 14) = 0 mm S = (2000 × 0.280) + 0 = 560 mm → The light curtain plane must sit at least 560 mm from the nearest hazard. ``` Now make the curtain coarser (30 mm hand resolution) and watch the distance jump: ``` C = 8 × (30 − 14) = 128 mm S = (2000 × 0.280) + 128 = 688 mm ``` > **Safety rule:** If you measured the machine's stop time *once* at commissioning and never again, your distance is fiction. Stop time degrades as brakes wear, hydraulics age, and loads change. ISO 13855 standoff is only valid against the *current* stopping performance — measure it periodically with a stop-time analyzer and re-derive S. Two more traps. First, the stopping time T must be the *worst case* — heaviest load, full speed, fastest approach geometry. Second, you must prevent reaching *over, under, or around* the field; the perpendicular formula assumes straight-on approach, and a low light curtain you can step over or reach above is worthless. ISO 13855 has additional terms for angled and parallel approach — use them. ## Cobots & collaborative safety vs traditional guarding Collaborative operation does not delete the safety case. It *replaces separation in space (a fence) with separation in time, or with biomechanical force limits* — and both replacements are harder to validate than a fence, not easier. The four collaboration modes (defined in ISO 10218-2, detailed in ISO/TS 15066 and now in ISO 10218:2025): | Mode | Mechanism | Human–robot contact | Key safety function | Standard limit | |---|---|---|---|---| | **Safety-rated monitored stop (SRMS)** | Robot stationary (Cat 2 / SOS, power on) while human is present | Only when robot stopped | SOS + presence detection | Robot motion = 0 while human in workspace | | **Hand guiding (HG)** | Operator moves the robot via a safety-rated guiding device + enabling switch | Yes — via the handle | SLS + enabling device + emergency stop | Safety-rated reduced speed (e.g. 250 mm/s) | | **Speed & separation monitoring (SSM)** | Robot speed scales with measured distance to human; stops if too close | No — separation maintained | SLS + safety-rated distance sensing | Protective separation distance maintained continuously | | **Power & force limiting (PFL)** | Contact forces/pressures held below biomechanical limits | Yes — intended or incidental | Safe force/torque monitoring | ISO/TS 15066 force & pressure tables, 29 body regions | The PFL force limits are the part that makes collaboration *quantitative*. ISO/TS 15066 publishes maximum permissible quasi-static (clamping) and transient (free-impact) forces and pressures for 29 body regions — the skull/forehead being the most restrictive at roughly 130 N quasi-static. You validate against them *physically*, with a calibrated force/pressure gauge, at the actual speed and with the actual end effector and workpiece. A spreadsheet does not close a PFL safety case; a force measurement does. The SSM separation distance is essentially the ISO 13855 logic generalised to a moving robot: the protective separation distance must account for the robot's stopping distance, the human's approach speed, the sensor latency, *and* the robot's own contribution to closing speed. It scales dynamically with the robot's velocity. > **Safety rule:** "Collaborative" describes an application validated by risk assessment, not a robot you bought. The end effector, the workpiece, and the actual run speed all leave the collaborative envelope independently — a force-limited arm holding a knife, a hot part, or a sharp blank is not a collaborative application. Re-validate whenever any of them changes. The honest deployment reality: a large fraction of "cobots" in production run *fenced, at full speed*, used purely as cheap, easy-to-program light industrial arms — a completely legitimate choice that is simply not collaborative operation. The full treatment, including the biomechanical tables and the joint hardware that makes contact sensing possible, is in [collaborative robots (cobots)](/posts/collaborative-robots-cobots-ultimate-guide/); the conventional six-axis arm and its guarding live in [industrial robot arms](/posts/industrial-robot-arms-ultimate-guide/). ## AMR / mobile machine safety (ISO 3691-4, R15.08) A robot that moves through shared floor space is a different animal. There is no fence to stand behind because the hazard zone travels with the machine. Mobile machines get their own Type C standards: **ISO 3691-4** (driverless industrial trucks and their systems) in the international regime, and **ANSI/RIA R15.08** (industrial mobile robots) in the US — the latter created precisely because the existing R15.06 (fixed robots) and the truck standards didn't cleanly cover AMRs carrying manipulators. The core safety function for an AMR is **safety-rated speed and obstacle detection** via safety laser scanners (IEC 61496-3) whose protective fields **scale with speed**: the faster the vehicle, the longer its stopping distance, so the protective field must extend further ahead. A well-designed AMR switches field sets dynamically — a long forward field at speed, narrowing on turns, a tight field at creep speed near a docking station. The scanner detects a person or obstacle and commands a safety-rated stop with a stopping distance the field was sized to cover, accounting for the *loaded* mass (a laden AMR stops slower than an empty one). Other mobile-specific functions: safety-rated speed limiting (SLS analogue), tip-over and load stability, safe steering/braking, and — where the AMR carries a manipulator — the full ISO 10218 arm safety case *on top of* the mobile base case, because the arm can reach a person the base scanner doesn't see. That composite (mobile base + manipulator) is exactly what R15.08 was written to address. > **Safety rule:** An AMR's safe stopping distance is a function of speed *and* payload *and* floor friction. The scanner protective field must be sized for the worst-case combination, and the field set must change with commanded speed. A fixed field sized for empty-and-slow is unsafe the moment the vehicle is loaded-and-fast. The detailed treatment of AMR/AGV navigation, drivetrains, and safety architecture is in [mobile robots: AMR & AGV](/posts/mobile-robots-amr-agv-ultimate-guide/). ## Validation, documentation & CE compliance Designing the safety functions is half the job. *Proving* they work — and recording the proof — is the other half, and it is the half that separates a real safety system from a hopeful one. **Validation** (ISO 13849-2 / IEC 62061) is the systematic confirmation, by analysis and *testing*, that every safety function performs as specified and reaches its required PL/SIL. It is not a code review and it is not a calculation. It includes: - **Verification of the PL/SIL calculation** — the SISTEMA file or equivalent, with the real component data, MTTFD, DC, CCF, confirming achieved PL ≥ PLr for every function. - **Functional testing** — trip each safeguard and confirm the correct stop category and reaction. Open the gate, break the curtain, violate the scanner field, press every E-stop. - **Fault injection** — this is the part people skip and shouldn't. For Category 3/4 functions you must demonstrate single-fault behaviour: short a channel, disconnect a wire, force a contact, and confirm the function still performs (Cat 3) and/or the fault is detected (Cat 3/4). If a single fault silently defeats your "redundant" function, it was never Category 3. - **Stop-time measurement** — measure the actual total stopping time with a stop-time analyzer, under worst-case load and speed, and confirm the ISO 13855 standoff distances are still valid against it. - **Environmental and EMC** — confirm the safety functions hold up under the temperature, vibration, and electrical noise of the real installation. **Documentation** is the technical file: the risk assessment, the list of safety functions with their PLr/SIL targets and achieved values, the validation records, the wiring and circuit diagrams of the safety system, the stop-time measurements, and the component certificates. This is your evidence, and in the event of an incident it is what an investigator (and a court) will read. **CE compliance** under the EU Machinery Regulation 2023/1230 (applicable from 20 January 2027, replacing Directive 2006/42/EC): the integrator of the robot *cell* is the manufacturer of the machine, responsible for the assembly's conformity even though the robot arm arrived with its own partial documentation (a Declaration of Incorporation for partly completed machinery). You assess the whole cell against the essential health and safety requirements, compile the technical file, issue the Declaration of Conformity, and affix the CE mark. Some machinery in the Regulation's higher-risk categories requires involvement of a Notified Body — check whether your configuration falls in scope. > **Safety rule:** The CE mark certifies the *cell as integrated and installed*, not the robot you unboxed. The robot vendor's documentation gets you to a partly completed machine; the integrator owns the conformity of the finished cell — including every modification made after commissioning. Change the gripper or move a scanner, and the conformity argument must be revisited. In the US the equivalents are NFPA 79 (electrical), ANSI/RIA R15.06 for the robot, and the risk-assessment discipline of ANSI B11. Different paperwork, same engineering. The standards diverge in administrative detail; the physics of a 50 kg payload at 2 m/s does not care which continent you are on. ## Frequently asked questions **Is a CE-marked robot safe to use out of the box?** No. CE on the robot covers the robot as a component (often as partly completed machinery with a Declaration of Incorporation). The *cell* — robot plus end effector, workpiece, guarding, and layout — is a new machine that the integrator must assess and CE-mark in its own right. The robot's CE mark is necessary, not sufficient. **What's the difference between an emergency stop and a protective stop?** An emergency stop is a manual, last-resort complementary measure (the red mushroom), Category 0 or 1, requiring manual reset — you cannot rely on a human to press it in time, so it is never a primary safeguard. A protective (safeguarded) stop is the automatic stop triggered by a safeguard (curtain, gate, scanner); it is the workhorse safety function and may auto-resume or require reset depending on the mode. **Do stop categories tell me how fast the machine stops?** No — they describe how *power* is handled. Category 0 removes power immediately (uncontrolled stop, motor coasts). Category 1 brakes under power then removes it (controlled stop, then power off). Category 2 brakes and *keeps* power (controlled stop, machine stays energized). Stopping *time* is a separate measured quantity that feeds the ISO 13855 distance. **Is STO the same as an emergency stop?** No. STO (Safe Torque Off, IEC 61800-5-2) is the drive function that removes torque-producing energy — it is the *mechanism* underneath a Category 0 stop. STO does not decelerate a load; on a vertical or high-inertia axis you need SS1 (controlled ramp then STO) or a safe brake, or the load drops/coasts dangerously. **How do I choose between ISO 13849 (PL) and IEC 62061 (SIL)?** Both are valid for machinery and now interoperate via PFHD. ISO 13849-1 (PL, with SISTEMA and the Category model) is the intuitive default for conventional and simpler architectures — most machine builders use it. IEC 62061 (SIL) is the better fit for complex, programmable, software-heavy safety systems where its rigorous treatment of systematic and software faults earns its keep. Pick one per project and stay in it. **What PL does a robot protective stop usually need?** It comes out of the risk assessment, but most robot protective stops and E-stops land at PLr = d (≈ SIL 2), and high-exposure, unavoidable, serious-injury hazards push to PLr = e (≈ SIL 3). Low-exposure functions can be PL c. Never assume — derive it from the ISO 13849-1 risk graph. **Why can't I just buy a PL e safety relay and be done?** Because PL is an end-to-end property of the whole function — sensor + logic + actuator in series. A PL e controller wired to a single-channel Category B sensor is a Category B function. The achieved PL is set by the *weakest subsystem* and the architecture (Category, MTTFD, DC, CCF), not by any single component's rating. **How far does a light curtain need to be from the hazard?** Use ISO 13855: `S = K·T + C`. With K = 2000 mm/s (hand approach), a total stop time T of, say, 0.28 s, and a 14 mm-resolution curtain (C = 0), S ≈ 560 mm. Coarser resolution increases C and pushes the curtain further back. Re-derive whenever stop time changes — and measure stop time periodically. **Does a safety fieldbus need a special, ultra-reliable network?** No — that's the point of the black channel. The safety protocol (PROFIsafe, CIP Safety, FSoE) wraps each message in its own integrity layer (sequence number, watchdog, safety CRC, connection ID) and detects corruption, loss, delay, repetition, and misrouting itself, so it runs over ordinary networks shared with standard traffic. But the network's worst-case latency still counts against your stop-time budget. **Are collaborative robots inherently safer than fenced robots?** No — they shift the safety case rather than remove it. PFL replaces separation with biomechanical force limits you must validate physically; SSM replaces fences with safety-rated scanners. Both are harder to validate than a fence. The end effector, workpiece, and run speed each leave the collaborative envelope independently. Many "cobots" run fenced at full speed in practice. **What's different about AMR safety?** The hazard zone travels with the machine, so there's no fence. ISO 3691-4 (and R15.08 in the US) require safety-rated obstacle detection via scanners whose protective fields scale with speed and account for loaded stopping distance, plus tip-over/stability and safe braking. An AMR carrying a manipulator stacks the ISO 10218 arm case on top of the mobile base case. **What does validation actually require — is the calculation enough?** No. ISO 13849-2 / IEC 62061 require functional testing and *fault injection*: trip every safeguard, confirm the correct stop, and for Category 3/4 prove single-fault behaviour by injecting faults (short a channel, pull a wire) and confirming the function still performs and/or detects the fault. Plus a measured stop time. An unverified calculation is a wish, not validation. ## Changelog - **2026-06-14** — Initial publication. --- # Robot Actuators: Electric, Hydraulic & Pneumatic — The Ultimate Guide URL: https://blog.robo2u.com/posts/robot-actuators-ultimate-guide/ Published: 2026-06-13 Updated: 2026-06-20 Tags: actuators, electric-actuators, hydraulic-actuators, pneumatic-actuators, series-elastic, linear-actuators, robotics-hardware, power-density, guide Reading time: 34 min > A working engineer's guide to robot actuators — electric, hydraulic, pneumatic, series-elastic, QDD, and soft — with real power/force-density numbers, products, and a selection cheat-sheet. An actuator is the thing that actually moves. Sensors perceive, controllers decide, structure holds it all together — but the actuator is where electrical or fluid power becomes mechanical work, and it is almost always the component that decides what your robot can and cannot physically do. Pick the wrong one and no amount of clever control will save you. Pick the right one and a mediocre controller still does useful work. This guide is the long version. We'll go family by family — electric, hydraulic, pneumatic — then through the things that don't fit neatly in a box: series-elastic actuators (SEA), quasi-direct-drive (QDD), pneumatic muscles, shape-memory alloy (SMA), and piezo. For each, real numbers with units, real products you can buy, and opinions with reasons attached. The goal is that you finish able to size and select an actuator for a specific job, not just recite a textbook taxonomy. **The take**: For 90% of robotics built in 2026, an electric BLDC motor plus a gearbox is the right answer — it's controllable, clean, efficient, and the supply chain is mature. Hydraulics win only when you need extreme force density in a small envelope and can tolerate the mess; pneumatics win only at the gripper, where cheap compliance and speed matter more than precision. The interesting frontier isn't a new energy source — it's how we *arrange* the electric motor: low gear ratios (QDD) and deliberate elasticity (SEA) are what make legged and contact-rich robots work. Companion reading: [servo motors](/posts/servo-motors-ultimate-guide/), [brushless DC motors](/posts/brushless-dc-motors-bldc-ultimate-guide/), [gearboxes (harmonic & cycloidal)](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/), and [end-effectors & grippers](/posts/end-effectors-grippers-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [What an actuator actually is](#what) 3. [The tradeoff space](#tradeoffs) 4. [Electric actuators](#electric) 5. [Hydraulic actuators](#hydraulic) 6. [Pneumatic actuators](#pneumatic) 7. [Linear actuators deep-dive](#linear) 8. [Series-elastic & variable-stiffness](#sea) 9. [Quasi-direct-drive (QDD)](#qdd) 10. [Soft & novel actuators](#soft) 11. [Backdrivability, transparency & force control](#backdrive) 12. [Sizing & selecting an actuator](#sizing) 13. [Comparison tables & cheat-sheet](#tables) 14. [Frequently asked questions](#faq) ## Key takeaways - An actuator converts stored energy (electrical, hydraulic, pneumatic) into controlled mechanical motion. It is the muscle; everything else is nerves and bone. - The three big families are electric, hydraulic, and pneumatic. Emerging classes — SEA, QDD, soft/McKibben, SMA, piezo — are mostly clever rearrangements or niche physics, not replacements. - **Power density** (W/kg) and **force density** (N/kg or N/cm² of area) are different axes. Hydraulics dominate force density; electric motors dominate controllability and efficiency. - Hydraulic actuators reach roughly 5,000–35,000 kPa (50–350 bar) working pressure, giving cylinder force densities no electric drive matches in the same envelope. The cost is pumps, hoses, heat, leaks, and maintenance. - Pneumatics run at ~600–1,000 kPa (6–10 bar), are inherently compliant and cheap, and own factory end-of-arm tooling for exactly those reasons. They are poor at mid-stroke position control. - **Electric BLDC + gearbox** is the default for arms, AGVs, cobots, and most automation: 85–95% efficient, clean, precise, and backed by a deep supply chain (Maxon, Kollmorgen, Harmonic Drive, Nabtesco). - **Backdrivability** is set mostly by gear ratio and friction, not by the motor. High-ratio harmonic/worm drives are effectively non-backdrivable; low-ratio QDD drives are transparent. - **Series-elastic actuators** deliberately put a spring between motor and load to turn position control into force control and to survive impacts — the basis of much legged-robot and rehab hardware. - **QDD actuators** (BLDC + 6:1 to 10:1 single-stage gearing + FOC) are why MIT Cheetah, Unitree quadrupeds, and modern humanoids can do dynamic, contact-rich motion with proprioceptive force sensing. - Atlas famously ran hydraulics for years for force density, then Boston Dynamics rebuilt it all-electric in 2024 — a clean signal of where the field is heading once electric force density is "good enough." - For linear motion: **ball-screw** for efficiency and load, **lead-screw** for low cost and self-locking holding, **belt** for speed over long strokes, **linear motor** for bandwidth and zero backlash. - Soft and novel actuators (McKibben muscles, SMA, piezo, EAP) are real but niche — used where compliance, silence, scale, or unusual form factors beat raw performance. - Size by the *worst* point in the duty cycle, not the average. Thermal limits, not torque limits, kill most actuators in the field. ## What an actuator actually is Strip away the marketing and an actuator does one job: take stored power and produce a controlled force or torque over a displacement. The "controlled" part is what separates an actuator from a motor or a cylinder bought off a shelf. A bare BLDC motor is a transducer; bolt on a gearbox, an encoder, and a drive running field-oriented control and you have an *actuator* — a closed-loop force/position source you can command. ### The muscle analogy, used carefully Biology is a useful frame if you don't take it too far. Muscle is a linear, contractile, compliant actuator with absurd control resolution (motor units recruited progressively) and the ability to act as both motor and brake. It's also slow to respond chemically, can only pull (never push), and has terrible peak power compared to its continuous power. Most engineered actuators invert that: rotary, can push and pull, fast, but stiff and with poor intrinsic energy storage. The whole story of SEA, QDD, and soft actuators is the field trying to claw back muscle's good properties — compliance, impact tolerance, force control — without giving up the electric motor's controllability. ### The three families plus the frontier **Electric** — electromagnetic torque from current in a magnetic field. Rotary by nature (BLDC, brushed DC, stepper, AC servo), made linear with screws, belts, or by literally unrolling the motor (linear motors). Dominates by sheer breadth. **Hydraulic** — pressurized incompressible fluid (oil) pushes a piston. Enormous force density, high stiffness, but needs a power unit and plumbing. **Pneumatic** — compressed air pushes a piston or inflates a structure. Cheap, fast, compliant, clean, but soft and hard to position precisely. **The frontier** — series-elastic (a spring in series with an electric drive), variable-stiffness (a tunable spring), QDD (low-gear-ratio electric), and the genuinely different physics of McKibben muscles, SMA, piezo, and electroactive polymers. > Rule of thumb: if you can't name the energy source, the conversion mechanism, and the control variable (current? flow? pressure?), you don't yet understand the actuator well enough to size it. ## The tradeoff space There is no best actuator, only best-for-a-job. The job is defined by where it sits in a multi-axis tradeoff space. Get fluent in these axes and selection becomes mechanical. ### The axes that matter **Power density (W/kg)** — how much mechanical power per unit mass. Matters for anything that moves the actuator itself: legs, arms, drones, mobile robots. Hydraulic *systems* are heavy because of the power unit, but hydraulic *actuators* at the joint are light and powerful. **Force/torque density (N/kg, N·m/kg, or N/cm²)** — peak force in a given size or mass. Hydraulic cylinders are the champions: a 50 mm bore cylinder at 21,000 kPa (210 bar) makes about 41 kN of push. No comparable-mass electric drive comes close. **Bandwidth (Hz)** — how fast the actuator can change force/position. Piezo: kHz. Electric direct-drive: 100s of Hz. Geared electric: 10s of Hz at the output. Hydraulic: tens of Hz, valve-limited. Pneumatic: a few Hz for controlled motion because air is compressible. **Controllability** — how precisely and linearly you can command output. Electric wins outright: torque is nearly proportional to current. Hydraulic is good with servo-valves. Pneumatic is poor mid-stroke. **Efficiency** — electric drivetrains hit 85–95% wall-to-shaft. Hydraulic systems are 40–60% wall-to-work after pump, valve throttling, and leakage losses. Pneumatic is brutal: 10–20% wall-to-work once you count compressor inefficiency and expansion losses. Pneumatic air is the most expensive energy in the factory per joule delivered. **Backdrivability / transparency** — can the load move the actuator? Critical for contact, safety, and force sensing. Set mostly by gear ratio and friction. Direct-drive and QDD are transparent; harmonic and worm drives are not. **Cost & supply chain** — a NEMA 23 stepper is $25. A Harmonic Drive actuator module is $1,500–4,000. A servo-valve is $1,000–3,000. A custom hydraulic power unit is five figures before you've moved anything. ### You can't max all of them These axes trade against each other. Adding a gearbox multiplies torque density but destroys backdrivability and adds backlash. A servo-valve gives a hydraulic actuator bandwidth but costs more than the cylinder. A series spring buys you force control and impact tolerance at the direct cost of position bandwidth. Every actuator choice is a position in this space, and the art is knowing which axis your application actually cares about. ## Electric actuators If you're building a robot in 2026 and you don't have a specific reason to do otherwise, you're using electric actuators. They're clean, controllable, efficient, quiet enough, and supported by the deepest component ecosystem of any family. ### Rotary: the BLDC + gearbox stack The workhorse is a brushless DC (BLDC) or AC servo motor driven by field-oriented control, almost always followed by a gearbox. See the [BLDC deep-dive](/posts/brushless-dc-motors-bldc-ultimate-guide/) and the [servo-motor guide](/posts/servo-motors-ultimate-guide/) for the motor side; here we care about the actuator as a unit. Why the gearbox? A typical 100–500 W BLDC motor wants to spin at 3,000–8,000 rpm and makes modest torque — tenths of a N·m to a couple of N·m continuous. A robot joint wants tens to hundreds of N·m at tens of rpm. The [gearbox](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/) bridges that gap. Reduction `N` multiplies torque and divides speed (minus efficiency): ``` T_out = T_motor × N × η_gear ω_out = ω_motor / N ``` Common choices: - **Planetary** — 3:1 to ~100:1 per stage, 90–97% efficient, some backlash (arcmin-class), cheap and robust. Good general-purpose. - **Harmonic (strain-wave)** — 30:1 to 160:1 single stage, near-zero backlash, compact, but ~70–90% efficient and not cheap. The default for arm joints where precision matters (used heavily by industrial-arm and cobot makers; Harmonic Drive LLC owns this space). - **Cycloidal** — 30:1 to 200:1, high shock-load capacity, low backlash, good for high-torque base joints. Nabtesco RV series dominates heavy industrial arms. Maxon's EC-series motors with GP gearheads, Kollmorgen frameless kits, and integrated modules from Harmonic Drive (FHA/SHA series) are the components you actually buy. ### Linear: turning rotation into a push Electric linear actuators take a rotary motor and convert with a screw or belt — covered in depth in the [linear section below](#linear). For now: ball-screw for efficiency, lead-screw for cost and self-locking, belt for long fast strokes, linear motor for bandwidth. ### Why electric wins by default - Torque is proportional to current — clean, fast, linear control with a cheap current sensor. - 85–95% efficiency means modest cooling and modest batteries. - No fluids, no compressor, no leaks, no separate power unit. - Encoders are cheap and precise; closed-loop position control is a solved problem. - The supply chain is enormous, so prices keep falling and availability is good. The honest weaknesses: peak force density trails hydraulics, and at very high continuous torque the motor gets thermally limited — copper losses scale with current², so doubling torque quadruples heating. That thermal wall, not the torque rating on the datasheet, is what kills electric actuators in real duty cycles. ## Hydraulic actuators Hydraulics are about force density and stiffness, full stop. When you need huge force in a small joint envelope and you can tolerate the supporting infrastructure, nothing else competes. ### How the system is built A hydraulic system is a *system*, not a part: an electric or combustion-driven **pump** pressurizes oil, an **accumulator** stores energy and smooths spikes, **valves** (especially servo-valves and proportional valves) meter flow to **cylinders** (linear) or **hydraulic motors** (rotary). A reservoir, filters, and a cooler round it out. Working pressures are typically 5,000–35,000 kPa (50–350 bar), with mobile and aerospace systems pushing 21,000–35,000 kPa (210–350 bar). Cylinder force is just pressure times piston area: ``` F = P × A A = π/4 × D² (D = bore diameter) Example: D = 50 mm, P = 21,000 kPa (21 MPa) A = π/4 × (0.050 m)² = 1.96 × 10⁻³ m² F = 21 × 10⁶ Pa × 1.96 × 10⁻³ m² ≈ 41,200 N ≈ 41 kN ``` 41 kN from a 50 mm cylinder. Bosch Rexroth, Parker, Moog, and Eaton supply this world; Moog servo-valves are the reference for high-bandwidth force control. ### Why Atlas used hydraulics — then dropped them For years the Boston Dynamics Atlas humanoid was hydraulically actuated, and the reason was force density: hydraulic actuators let Atlas pack the peak joint torques needed for jumps, backflips, and recovery into a human-sized envelope. Hydraulic stiffness also gives crisp force control through good servo-valves. But hydraulics on a legged robot are a nightmare to live with. They leak (Atlas videos famously showed fluid streaks), they're loud, the power unit and plumbing are heavy and inefficient, and maintenance is constant. In 2024 Boston Dynamics retired the hydraulic Atlas and revealed an all-electric Atlas. That's the headline event of this decade in actuation: once electric drives (QDD-style, see [below](#qdd)) got close enough on force density, the operational advantages of electric — efficiency, cleanliness, controllability, no plumbing — won decisively. Agility Robotics' Digit was electric from the start for the same reasons. ### When hydraulics still win - Heavy construction and forestry robots, excavators-turned-autonomous, large manipulators. - Anything needing >50 kN at a single joint in a tight envelope. - High-stiffness force application (presses, test rigs). - Situations where a combustion engine already provides the prime mover. > If your robot fits through a normal door and runs on batteries, you almost certainly don't want hydraulics in 2026. ## Pneumatic actuators Pneumatics trade precision for cheapness, speed, compliance, and cleanliness. That trade is exactly right at the gripper and exactly wrong almost everywhere else. ### How it works and what's available Compressed air at ~600–1,000 kPa (6–10 bar) from a shop compressor feeds cylinders, rotary actuators, grippers, and vacuum generators through solenoid or proportional valves. Festo and SMC are the dominant suppliers; a Festo DSNU round cylinder or an SMC MHZ2 parallel gripper is in tens of thousands of factory cells worldwide. Force again is pressure times area, but the pressures are 10–50× lower than hydraulic, so a 32 mm bore cylinder at 600 kPa makes only about 480 N. You get speed and softness, not brute force. ### Why pneumatics own end-of-arm tooling Walk any factory and the [grippers](/posts/end-effectors-grippers-ultimate-guide/) are mostly pneumatic. Reasons: - **Cheap compliance** — air is a spring. A pneumatic gripper naturally accommodates part variation and won't crush a fragile part if you regulate pressure. Getting equivalent compliance from an electric gripper means force sensing and control loops. - **Speed** — open/close cycles in tens of milliseconds. Pick-and-place loves this. - **Two-state simplicity** — most grippers and clamps only need open/closed. Solenoid valve, done. No drive, no encoder, no tuning. - **Cleanliness & safety** — no electrical sparking at the tool (good for ATEX/explosive environments), and exhausted air is clean. - **Vacuum** — a Venturi vacuum generator off the same air supply handles suction-cup picking of boxes, sheets, and glass. ### Where pneumatics fail Mid-stroke position control. Air compresses, so a pneumatic cylinder is a poorly-damped spring-mass system that wants to slam to the endstops. You can servo-control pneumatics with proportional valves and good feedback, but it's finicky and rarely worth it versus an electric actuator. Energy efficiency is also terrible — 10–20% wall-to-work — making compressed air the most expensive utility per joule in most plants. > Use pneumatics for binary, fast, compliant, clean tasks at the tool. Don't ask them to hold a precise mid-stroke position. ## Linear actuators deep-dive Lots of robotics motion is linear — Cartesian gantries, presses, Z-axes, telescoping joints. The conversion mechanism dominates the actuator's character far more than the motor does. ### Ball-screw A ground screw with recirculating ball bearings between screw and nut. **80–95% efficient**, high load capacity, long life, low friction. Because of low friction it's also **backdrivable** — gravity or load can spin it — which means a vertical axis needs a brake. Used wherever efficiency and load matter: machine tools, heavy gantries, high-end linear actuators (e.g. Thomson, NSK, Bosch Rexroth screw assemblies). ### Lead-screw (ACME / trapezoidal) Sliding-contact thread, often with a polymer nut. **20–50% efficient** — the high friction is the point: it makes the screw **self-locking** (non-backdrivable) so it holds position with zero power. Cheap, simple, fine for low-duty positioning and anything that must hold a load when de-energized. The efficiency penalty means more motor for the same output. ### Belt drive A toothed belt over pulleys. Lower force, but **very fast over long strokes** and cheap. Backlash from belt stretch limits precision. The standard choice for the long axis of a gantry or a 3D-printer-style motion system where speed beats stiffness. ### Linear motor (direct drive) No screw or belt — the motor's force acts directly on the moving stage (an unrolled BLDC). **Zero backlash, very high bandwidth (100s of Hz), high acceleration, no wear parts in the drivetrain.** The downsides: lower force density (you're paying for every newton with magnets and copper), heat dissipation into the structure, and cost. Used in semiconductor lithography, pick-and-place machines, and high-throughput inspection — anywhere settling time and precision dominate. ### Lead/pitch, and no-load vs loaded Screw output force and speed depend on **lead** (axial travel per revolution): ``` v_linear = (rpm / 60) × lead F_linear ≈ (2π × η × T_motor) / lead Smaller lead → more force, less speed (and more self-locking tendency) Larger lead → more speed, less force, more likely backdrivable ``` A subtle trap: efficiency is *load-dependent*. A lead-screw might show a reasonable static efficiency on the datasheet but be far worse under light load and dynamic conditions. Always check efficiency at your actual operating force, and remember that backdriving efficiency is lower than driving efficiency — that asymmetry is what makes self-locking possible. (See the [comparison table](#tables) for a side-by-side.) ## Series-elastic & variable-stiffness Here's the counterintuitive idea that reshaped legged and rehab robotics: deliberately make your actuator *softer* by putting a spring in series between the motor/gearbox and the load. ### Why add a spring on purpose A stiff geared actuator is a great position source and a terrible force source — tiny position errors create huge forces, and impacts spike loads through the gear teeth. Insert a known spring in series and three things happen: 1. **Force becomes measurable from deflection.** Measure the spring's compression with an encoder and you know output force exactly: `F = k × Δx`. The spring is your force sensor. 2. **Force control becomes position control of the spring.** The motor servos spring deflection, which is far more robust than trying to control force through a stiff, high-friction gearbox. 3. **Impact energy is absorbed by the spring**, not slammed through the gear teeth — the actuator survives footstrikes and collisions that would destroy a rigid drive. The cost: the spring adds a low-frequency pole, so position bandwidth drops. You've traded crisp positioning for clean force control and robustness. For a leg hitting the ground, that's a fantastic trade. ### Where SEAs are used Gill Pratt's SEA work led to robots like the original Cog/M2 and, more famously, the actuators behind much of modern legged robotics. Boston Dynamics and Agility have used elastic elements in legs; rehabilitation exoskeletons and the Valkyrie/THOR-class humanoids used SEA extensively because gentle, controllable force against a human body is the whole job. ### Variable-stiffness actuators (VSA) A VSA lets you *tune* the series stiffness on the fly — soft for a delicate or dynamic task, stiff for precise positioning. Mechanically it's usually two motors antagonistically loading nonlinear springs (the DLR/VSA-II and "MACCEPA" designs are the canonical references). They're complex and heavy for what they deliver, so they've stayed mostly in research, but the concept — match impedance to the task — is exactly right and shows up in software form (impedance control) on QDD robots instead. ## Quasi-direct-drive (QDD) If SEA is the mechanical answer to force control, QDD is the electrical-plus-software answer, and it's the one that's actually winning in legged and humanoid robots. ### The idea: skip the big gearbox A direct-drive motor (no gearbox) is perfectly backdrivable and transparent, but to make joint-level torque it must be huge and heavy. A high-ratio geared motor is compact but stiff, non-backdrivable, and can't sense external force without a torque sensor. QDD splits the difference: a **large-diameter, high-torque BLDC motor** plus a **single low-reduction stage, typically 6:1 to 10:1**, driven by field-oriented control. Why this works so well: - Low gear ratio means **the actuator stays backdrivable** — the load can move the motor, and friction is low. - Because torque ≈ current and the gearing is light, you can **estimate output torque from motor current alone** — proprioceptive force control, no extra torque sensor. This is the key trick. - The big motor provides enough torque density that a single stage is sufficient for legs. - FOC gives you high-bandwidth current (hence torque) control. ### The lineage The MIT Cheetah (Sangbae Kim's lab) productionized QDD: custom high-torque "gap-radius" motors with ~5–7:1 planetary stages and current-based torque estimation enabled fast, robust, contact-rich running and jumping. That architecture went commercial through Unitree (the quadrupeds, and the cheap motor modules everyone now prototypes with) and is the actuation backbone of most modern [legged robots](/posts/legged-quadruped-robot-hardware-ultimate-guide/) and [humanoids](/posts/humanoid-robot-hardware-ultimate-guide/). The all-electric Atlas, Unitree H1/G1, and many others lean on QDD-style joints. ### QDD vs SEA They solve the same problem — force control and impact tolerance — by different means. QDD does it with low gearing + current sensing (no physical compliance, so high bandwidth but it must control its own stiffness in software). SEA does it with a physical spring (intrinsic impact tolerance, lower bandwidth). The field has largely converged on QDD for dynamic locomotion because software impedance control on a transparent drive is more flexible than a fixed mechanical spring, and because removing the spring restores bandwidth. SEA persists where physical compliance is a hard safety requirement (against human bodies). > If you're building a legged or contact-rich robot today, start with QDD modules. They're now cheap enough to prototype with and give you force control "for free" from current sensing. ## Soft & novel actuators Beyond the big three lies a zoo of actuators that exploit different physics. Most are niche, but each owns a corner where conventional actuators are awkward. ### McKibben pneumatic muscles A rubber bladder inside a braided mesh sleeve. Inflate it and the braid geometry forces it to **shorten and fatten**, pulling like a muscle. Festo's "Fluidic Muscle" (DMSP/MAS) is the commercial example. - Contractile (pull-only), very high peak force-to-weight (up to ~1,500 N from a 20 mm Festo DMSP muscle), inherently compliant. - Nonlinear, hysteretic, needs air — control is harder than an electric drive. - Used in exoskeletons, biomimetic limbs, and lightweight assistive devices where muscle-like compliance and high force-to-weight beat precision. ### Shape-memory alloy (SMA) Nitinol wire that contracts ~4–5% when heated (electrically) above its transition temperature, returning when cooled. - Silent, tiny, high force-to-weight, no moving parts to wear. - **Slow** (cooling-limited, often >1 s cycle) and **inefficient** (you're heating metal), with limited strain and short fatigue life if overstrained. - Used in micro-grippers, deployable space mechanisms, medical devices, and anywhere silence and tiny scale dominate. ### Piezoelectric A piezo crystal strains a fraction of a percent under voltage — minuscule displacement but enormous bandwidth (kHz) and stiffness. - **Sub-nanometer resolution, kHz response, high force, microscopic stroke.** - Used directly for nanopositioning (microscope stages, lithography fine-stages, fast steering mirrors), and in **ultrasonic/inchworm piezo motors** (Physik Instrumente, Nanomotion) that accumulate tiny steps into macroscopic, high-resolution motion with zero backlash and self-locking holding. ### Electroactive polymers (EAP / dielectric elastomers) "Artificial muscle" polymers that strain under high electric fields. Large strain, soft, lightweight — but need kilovolts, suffer reliability/breakdown issues, and remain mostly a research curiosity in 2026 despite decades of promise. > Reach for a novel actuator only when a conventional one physically can't do the job — sub-micron precision (piezo), centimeter-scale silent motion (SMA), or muscle-like soft pulling (McKibben). Otherwise an electric drive is less trouble. ## Backdrivability, transparency & force control This deserves its own section because it's the property that decides whether your robot can safely touch the world — and it's the one engineers most often get wrong. ### Definitions **Backdrivable** — you can move the output by hand (or the load can move it) and the motor turns. **Transparent** — the actuator faithfully transmits forces in both directions with little distortion from friction or inertia. A direct-drive motor is both; a worm-gear drive is neither. ### What sets it Mostly **gear ratio and friction**, not the motor. Reflected inertia and friction scale with the *square* of the gear ratio: ``` J_reflected = J_motor × N² friction_reflected ≈ friction_motor × N² (plus the gearbox's own friction) ``` A 100:1 harmonic drive reflects the motor's tiny inertia as a large effective inertia at the output and adds its own meaningful friction — the result feels like trying to backdrive through molasses. A 6:1 QDD drive reflects 36× inertia, which is small enough that the joint stays transparent. ### Why it matters - **Force control** — a transparent drive lets you control force well (directly, or via current as in QDD). A non-backdrivable drive fights you and needs a separate torque sensor for clean force control. - **Safety / [cobots](/posts/collaborative-robots-cobots-ultimate-guide/)** — a backdrivable arm yields when it hits a person; a stiff geared arm transmits the full collision force. Cobots either use moderate gearing plus joint torque sensors (Universal Robots, KUKA iiwa) or accept the gearing and rely on current-based collision detection. - **Contact-rich tasks** — assembly, polishing, and any task involving controlled contact need the actuator to be a good force source, which means transparency or excellent torque sensing. The two roads to good force control: **(a)** make the drive transparent (QDD, direct-drive, SEA) and infer/measure force cheaply, or **(b)** keep the high gearing for torque density and add a dedicated joint torque sensor (Harmonic Drive + strain-gauge torque sensor, the classic industrial-arm-with-force-control approach). Road (a) is winning in mobile/legged/humanoid; road (b) still rules precise industrial arms. ## Sizing & selecting an actuator Now the practical part. Here's how to actually pick and size, in order. ### Step 1 — Build the force/torque budget Sum the worst-case loads at the actuator output: gravity, inertia (`τ = J × α`), friction, process forces, and a safety factor. For a rotary joint: ``` τ_peak = J_total × α_max + τ_gravity + τ_friction + τ_process ``` Size the actuator's **peak** torque above `τ_peak` with margin (1.5–2× is common), and the **continuous** torque above the RMS torque over the duty cycle. ### Step 2 — Compute the RMS / thermal load This is where most designs fail in the field. Motors are thermally limited; continuous torque depends on how fast heat leaves the windings. Compute RMS torque over the motion cycle: ``` τ_rms = sqrt( (1/T) × ∫ τ(t)² dt ) ``` `τ_rms` must stay under the continuous rating at your actual ambient and cooling. A motor that handles the peak can still cook itself if the *average* is too high. Doubling torque quadruples I²R heating — respect that exponent. ### Step 3 — Set speed and pick the gear ratio You know the output speed and torque you need; the motor has a speed/torque sweet spot. Pick `N` to map one onto the other, then check that backdrivability, backlash, and efficiency are acceptable. High `N` for torque density (industrial arm), low `N` for transparency (legged/cobot). ### Step 4 — Check bandwidth Does the actuator respond fast enough for the control task? Geared electric: fine for arms and AGVs. Need >50 Hz force control at the output? You're looking at QDD, SEA, direct-drive, or hydraulic with servo-valves — not a high-ratio harmonic drive. ### Step 5 — Apply the decision tree > **The decision tree, compressed:** > 1. Need precise position/torque, clean, battery-powered, fits through a door? → **Electric (BLDC + gearbox)**. Default. > 2. Need force control, impact tolerance, transparency for legs/contact? → **QDD** (or **SEA** if physical compliance is mandatory). > 3. Need >50 kN in a tight joint and can tolerate plumbing? → **Hydraulic**. > 4. Binary, fast, compliant, clean motion at the tool? → **Pneumatic**. > 5. Sub-micron precision? → **Piezo**. Silent centimeter-scale? → **SMA**. Muscle-like soft pull? → **McKibben**. ### Step 6 — Don't forget the boring stuff Connectors, encoder resolution, brake (any vertical/backdrivable axis), thermal path, IP rating, EMC, and whether you can actually buy it in volume. The actuator that's perfect on paper but has a 40-week lead time is the wrong actuator. ## Comparison tables & cheat-sheet Numbers below are representative order-of-magnitude figures for typical robotics-scale components, useful for first-pass selection — always confirm against the specific product datasheet. ### Actuator family comparison | Property | Electric (BLDC+gear) | Hydraulic | Pneumatic | SEA | QDD | Piezo | SMA | |---|---|---|---|---|---|---|---| | Power density (W/kg) | 100–300 | 300–600 (actuator) | 50–150 | 100–250 | 150–400 | low (high BW, tiny stroke) | low | | Force/torque density | Medium | **Very high** | Low | Medium | Medium–high | High (tiny stroke) | High (tiny stroke) | | Working "pressure"/source | DC bus 24–800 V | 5,000–35,000 kPa | 600–1,000 kPa | DC bus | DC bus | 100s of V | I²R heating | | Efficiency (wall→work) | 85–95% | 40–60% | 10–20% | 80–90% | 85–93% | high (static) | <10% | | Bandwidth | 10s–100s Hz | 10s Hz | few Hz | 10s Hz | 100s Hz | kHz | <1 Hz | | Controllability | Excellent | Good (servo-valve) | Poor mid-stroke | Excellent (force) | Excellent (force) | Excellent | Poor | | Backdrivable | Depends on ratio | Yes (with valve) | Somewhat (springy) | Yes | **Yes** | No (self-lock) | No | | Cleanliness | Clean | Leaks/oil | Clean | Clean | Clean | Clean | Clean | | Cost | Low–medium | High (system) | Low | Medium | Medium | High | Low | | Typical use | Arms, AGVs, cobots | Heavy/construction, ex-Atlas | Grippers, EOAT, vacuum | Legs, rehab, exo | Legged, humanoid | Nanopositioning | Micro/medical/space | ### Linear actuator comparison | Type | Efficiency | Backdrivable | Speed | Backlash | Relative cost | Pick it when | |---|---|---|---|---|---|---| | Ball-screw | 80–95% | Yes (needs brake) | Medium | Low | Medium | Efficiency + heavy load | | Lead-screw (ACME) | 20–50% | No (self-locking) | Low–medium | Low | Low | Cheap, must hold w/o power | | Belt drive | 90%+ | Yes | **High** | Medium (stretch) | Low | Long, fast strokes | | Linear motor | n/a (direct) | Yes | Very high | **None** | High | Bandwidth, precision, zero backlash | ### Selection cheat-sheet | If your priority is… | Reach for… | |---|---| | General-purpose robot joint | BLDC + planetary or harmonic | | Precise industrial arm joint | BLDC + harmonic/cycloidal + torque sensor | | Legged / dynamic locomotion | QDD modules (low ratio + FOC) | | Human-contact force control | SEA, or QDD/torque-sensed cobot drive | | Maximum force in tiny envelope | Hydraulic cylinder + servo-valve | | Fast binary gripping/clamping | Pneumatic cylinder/gripper | | Picking boxes/sheets/glass | Pneumatic vacuum (Venturi) | | Long fast Cartesian axis | Belt drive | | Heavy efficient linear axis | Ball-screw (+ brake if vertical) | | Hold a vertical load unpowered | Lead-screw (self-locking) | | Sub-micron positioning | Piezo stage / piezo motor | | Silent, tiny, low-cycle motion | SMA wire | | Muscle-like compliant pull | McKibben pneumatic muscle | ## Frequently asked questions **What's the difference between an actuator and a motor?** A motor is a raw transducer that converts energy to motion. An actuator is a complete, controllable motion unit — motor plus transmission, feedback, and drive electronics arranged to produce a commanded force or position. Every actuator contains a prime mover (motor, cylinder, etc.); not every motor is an actuator. **Why are most factory grippers pneumatic if pneumatics are so inefficient?** Because at the gripper you're paying for compliance, speed, simplicity, and cleanliness, not energy efficiency. A pneumatic gripper is an air spring that won't crush parts, cycles in tens of milliseconds, needs only a solenoid valve, and sparks nothing. Electric grippers match the precision but cost more and add control complexity. For binary clamping at the tool, pneumatics still win on total cost. **Why did Boston Dynamics switch Atlas from hydraulic to electric?** Hydraulics gave the old Atlas the force density for explosive moves, but they leaked, were loud and inefficient, and demanded heavy plumbing plus constant maintenance. By 2024, electric (QDD-style) actuators had enough force density to do the job, so the all-electric Atlas got better efficiency, cleanliness, and controllability with no fluid system. It's the clearest signal that electric is overtaking hydraulics wherever it can. **What is a quasi-direct-drive (QDD) actuator?** A large high-torque BLDC motor with a single low-reduction gear stage (about 6:1 to 10:1) driven by field-oriented control. The low ratio keeps it backdrivable and transparent, and because torque tracks motor current you can sense output force from current alone — proprioceptive force control with no extra torque sensor. It's the dominant architecture for legged and humanoid robots. **Why deliberately add a spring (SEA) — doesn't that hurt precision?** It hurts position bandwidth, yes, but it buys clean force control (force = spring stiffness × deflection, so the spring is your force sensor), impact tolerance (the spring absorbs shock instead of the gear teeth), and stable interaction with the environment. For a leg hitting the ground or a robot pushing on a human, that trade is exactly right. **What makes an actuator backdrivable, and why care?** Mostly low gear ratio and low friction — reflected inertia and friction scale with ratio squared. Backdrivability matters for force control, collision safety, and contact-rich tasks: a backdrivable arm yields when it hits something, while a high-ratio geared arm transmits the full collision force and needs a torque sensor to feel anything. **Ball-screw or lead-screw — how do I choose?** Ball-screw for efficiency (80–95%) and load capacity, but it's backdrivable so a vertical axis needs a brake. Lead-screw for low cost and self-locking holding — its high friction (20–50% efficiency) means it holds position with zero power, at the cost of needing a bigger motor for the same output. Cheap holding axis → lead-screw; efficient working axis → ball-screw. **When should I use a linear motor instead of a screw?** When you need very high bandwidth, high acceleration, zero backlash, and excellent settling — semiconductor stages, high-speed pick-and-place, precision inspection. You pay with lower force density, heat dumped into the structure, and higher cost. If raw force matters more than dynamics, a screw is cheaper and more force-dense. **How do I size an actuator so it doesn't overheat?** Size peak torque above your worst-case load with 1.5–2× margin, but the binding constraint is usually thermal: compute RMS torque over the full duty cycle and keep it below the continuous rating at your real ambient and cooling. Heating scales with current squared, so a duty cycle with brief high-torque spikes can still cook a motor that's "rated" for the peak. **Are soft/McKibben/SMA/piezo actuators ready for real robots?** In their niches, yes. Piezo is mature and standard for nanopositioning. SMA is used in micro-grippers, medical, and space deployables. McKibben muscles appear in exoskeletons and biomimetic limbs. They're not general-purpose replacements for electric drives — reach for them only when conventional actuators physically can't meet the precision, scale, silence, or compliance requirement. **Do hydraulics have any future in mobile robotics?** Limited. They still win for very high force in a tight envelope (heavy construction, forestry, large manipulators) and where a combustion engine already supplies power. But for battery-powered, human-scale robots, electric QDD has largely closed the force-density gap, and the operational disadvantages of hydraulics — weight, inefficiency, leaks, maintenance — make them hard to justify. **What's the single most common sizing mistake?** Sizing to the peak torque on the datasheet and ignoring the thermal/RMS load. Engineers see "10 N·m peak," design for 8 N·m, and then the actuator overheats because the *continuous* rating is 3 N·m and their duty cycle averages 4 N·m. Always size the continuous rating against RMS torque, then check peak separately. ## Changelog - **2026-06-13** — Initial publication. --- # SLAM & Robot Localization: The Ultimate Guide URL: https://blog.robo2u.com/posts/slam-localization-ultimate-guide/ Published: 2026-06-12 Updated: 2026-06-20 Tags: slam, localization, mapping, ekf, particle-filter, graph-slam, visual-inertial-odometry, loop-closure, guide Reading time: 38 min > A working engineer's deep guide to SLAM and robot localization in 2026: the chicken-and-egg problem, EKF vs particle filters vs factor-graph SLAM, lidar and visual-inertial stacks, loop closure, map representations, failure modes, and how to choose. A robot driving across a warehouse has to answer one question continuously, dozens of times a second: *where am I?* If it gets that answer wrong by 30 cm it clips a rack; if it gets it wrong by 2 m it is lost. The frustrating part is that the obvious way to answer it — "compare what I see to the map" — assumes you already have a map. And the obvious way to build a map — "stitch together what I see from each known pose" — assumes you already know where you are. You need the pose to build the map and the map to find the pose. That circular dependency is SLAM. This guide is about **Simultaneous Localization and Mapping** and its close cousin, localization against a *known* map. We will start from the state-estimation framing — the belief, the motion model, the observation model — then walk the three algorithmic families (EKF-SLAM, particle filters and FastSLAM, and modern factor-graph SLAM), the front-end/back-end split, scan matching, the real lidar and visual-inertial stacks engineers actually deploy (Cartographer, slam_toolbox, LIO-SAM, FAST-LIO2, ORB-SLAM3, VINS-Fusion, RTAB-Map, OpenVINS; AMCL for known-map localization), loop closure, map representations, the compute budget, the failure modes that will bite you, and how to choose. **The take**: in 2026 the default for almost any new system is **factor-graph (pose-graph) SLAM** with a tightly-coupled front-end — lidar-inertial outdoors and on fast platforms, visual-inertial where weight and cost dominate — and you keep filters (EKF, particle filter) for two jobs only: fusing fast proprioceptive sensors into a smooth odometry stream, and Monte-Carlo localization against a map you already trust. The single biggest lever on accuracy is not the algorithm; it is sensor quality, calibration, and whether your environment gives the front-end something to latch onto. Most "SLAM is broken" tickets are really a featureless corridor, a bad extrinsic, or an IMU nobody calibrated. Companion reading: [LiDAR & depth cameras](/posts/lidar-depth-cameras-ultimate-guide/), [robot sensors](/posts/robot-sensors-ultimate-guide/), [mobile robots: AMRs & AGVs](/posts/mobile-robots-amr-agv-ultimate-guide/), [motion planning & kinematics](/posts/motion-planning-kinematics-ultimate-guide/), [ROS 2](/posts/ros2-ultimate-guide/), and [machine vision](/posts/machine-vision-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [The chicken-and-egg problem](#chicken-egg) 3. [Problem framing: state, models, and the belief](#framing) 4. [Odometry sources and drift](#odometry) 5. [Filtering vs optimization: the great split](#filtering-vs-optimization) 6. [Front-end vs back-end](#front-back) 7. [Scan matching and lidar SLAM stacks](#lidar-slam) 8. [Visual SLAM and visual-inertial odometry](#visual-slam) 9. [Loop closure and place recognition](#loop-closure) 10. [Map representations](#maps) 11. [The sensor and compute budget](#budget) 12. [Degeneracy and failure cases](#failure) 13. [2D vs 3D, indoor vs outdoor](#dimensions) 14. [Selecting a stack and the Nav2 tie-in](#selecting) 15. [Frequently asked questions](#faq) ## Key takeaways - **SLAM is a chicken-and-egg problem solved by estimating both at once.** You jointly estimate the robot's trajectory *and* the map from noisy sensors. Treat them separately and the errors compound; estimate them together and the constraints between them cancel a lot of error. - **It is fundamentally state estimation.** Everything reduces to a *belief* — a probability distribution over where you are and what the world looks like — propagated by a **motion model** and corrected by an **observation model**. Pick the wrong noise models and no algorithm saves you. - **All odometry drifts; the only question is how fast.** Wheel odometry drifts on slip, IMU integration drifts as `t²` in position, visual and lidar odometry drift slower but still without bound. SLAM exists to bound that drift with loop closure and map constraints. - **The field moved from filters to optimization.** EKF-SLAM (`O(n²)` in landmarks) and particle-filter SLAM gave way to **factor-graph / pose-graph SLAM** because optimization over a sparse graph relinearizes, scales, and produces better maps. GTSAM, g2o, and Ceres are the backends. - **Filters still own two niches.** An EKF/UKF (e.g. `robot_localization`) fuses wheel + IMU + GPS into smooth high-rate odometry; a **particle filter (AMCL/MCL)** localizes against a *known* map. Neither is the right tool for building a large map from scratch anymore. - **Front-end vs back-end is the architecture.** The **front-end** turns raw sensor data into constraints (scan matches, feature tracks, loop detections); the **back-end** optimizes the graph of those constraints. Most failures are front-end failures. - **Scan matching is the lidar workhorse.** ICP (point-to-point/point-to-plane) and NDT register scans; the strong lidar stacks — **Cartographer, slam_toolbox, LIO-SAM, FAST-LIO2** — wrap matching in a graph and, increasingly, fuse the IMU tightly. - **Visual SLAM splits into feature-based and direct.** **ORB-SLAM3** (features + bag-of-words loop closure) is the reference; **VINS-Fusion** and **OpenVINS** are the visual-inertial standards. Tight IMU coupling beats loose coupling for robustness and scale observability. - **Loop closure is what makes a *map* instead of a long drift.** Recognizing a previously-visited place (bag-of-words / DBoW2, learned descriptors) adds a constraint that snaps the accumulated error back. Without it you have odometry with extra steps. - **Maps cost memory, and the cost is geometric.** A 2D occupancy grid at 5 cm over 100×100 m is ~4 MB; a 3D voxel grid at the same resolution over a building is gigabytes. Choose the representation (grid, point cloud, mesh, topological) for the consumer, not the sensor. - **Degeneracy is the real enemy.** Featureless corridors, symmetric rooms, glass, and dynamic scenes break the front-end. Detect degeneracy, lean on the IMU through it, and never trust a single-modality stack in an environment that can starve it. - **2D indoor and 3D outdoor are different sports.** Flat-floor AMRs want 2D lidar SLAM (slam_toolbox) + AMCL. Outdoor, uneven, or 6-DoF platforms want 3D lidar-inertial (FAST-LIO2/LIO-SAM) or VIO. Match the algorithm's assumptions to the world. - **Mapping and localization are separate runtime modes.** You usually map *once* (online or offline), freeze the map, then localize against it in production. The [Nav2](/posts/ros2-ultimate-guide/) stack expects exactly this split. ## The chicken-and-egg problem Start with the two operations a navigating robot needs and notice they each depend on the other. **Localization** is "given a map, where am I?" You compare a sensor reading — a lidar scan, a camera image — to a known map and find the pose that best explains it. This is the easier problem, and it is what runs in production once you have a map. **Mapping** is "given my poses, what does the world look like?" You take sensor readings from a sequence of *known* poses and fuse them into a consistent model. Also tractable — if you know the poses. The trouble is that on a fresh deployment you have neither. You do not know where you are because you have no map, and you cannot build a map because you do not know where you are. Worse, the errors are correlated: an error in your estimated pose places the landmark you just observed in the wrong spot on the map, and then that wrong landmark corrupts the *next* pose estimate. Errors feed each other. SLAM's insight is that you should not pick one to solve first. You estimate the trajectory and the map **jointly**, as one big coupled estimation problem, and you exploit the fact that the same landmark seen from multiple poses ties those poses together. Re-observing a landmark you mapped earlier is a constraint that pins down both the landmark *and* your current pose. Close a big loop — return to where you started — and a single constraint can correct hundreds of metres of accumulated drift across the whole trajectory at once. > **Rule of thumb:** if you have a trustworthy prior map and the environment is stable, you do not need SLAM — you need localization. Run SLAM to *build* the map, then switch to localization for production. Running full SLAM forever when a frozen map would do is a common and expensive mistake. That is the whole game: SLAM is the bootstrapping phase that gets you a map and a trajectory at once; localization is what you do afterward. Most of this guide is about doing the bootstrapping well, because it is the hard part. ## Problem framing: state, models, and the belief Strip away the implementation and SLAM is a Bayesian state-estimation problem. There are four objects you must define before any algorithm means anything. ### The state The **state** `x` is everything you are estimating. At minimum it is the robot's pose: in 2D that is `(x, y, θ)` — three numbers; in 3D it is position plus orientation, six degrees of freedom (often carried as a 7-vector with a unit quaternion, or on the `SE(3)` manifold). In full SLAM the state also includes the **map** — landmark positions, or a whole pose history, depending on the formulation. In visual-inertial systems the state grows to include velocity, accelerometer bias, and gyroscope bias, because you cannot estimate pose from an IMU without estimating its biases too. ### The motion model (prediction) The **motion model** `p(xₜ | xₜ₋₁, uₜ)` says how the state evolves given a control or proprioceptive input `uₜ` — wheel encoder ticks, an IMU sample, a commanded velocity. It is your prediction step. It is also where odometry drift is born: the model is never exact, and the uncertainty it injects grows every timestep with no observation to correct it. ### The observation model (correction) The **observation model** `p(zₜ | xₜ, map)` says what measurement `zₜ` you expect to see from a given state and map — what a lidar beam should return, where a visual feature should project. When a real measurement arrives, you compare it to the prediction and use the mismatch (the **innovation**, or **residual**) to correct the state. This is the step that fights drift. ### The belief The **belief** `bel(xₜ) = p(xₜ | z₁:ₜ, u₁:ₜ)` is the full posterior — the probability distribution over the state given everything you have ever sensed and commanded. The entire field is different ways to *represent* and *update* this belief: - A **Gaussian** (mean + covariance) → Kalman-family filters. - A **set of weighted samples (particles)** → particle filters. - A **maximum-a-posteriori point estimate from a graph of constraints** → factor-graph SLAM. ```text Recursive Bayes filter (the skeleton under everything): predict: bel⁻(xₜ) = ∫ p(xₜ | xₜ₋₁, uₜ) · bel(xₜ₋₁) dxₜ₋₁ correct: bel(xₜ) = η · p(zₜ | xₜ) · bel⁻(xₜ) η = normalizer. Predict grows uncertainty; correct shrinks it. ``` > **Rule of thumb:** the noise models matter as much as the algorithm. If you feed an EKF a wheel-odometry covariance that is 10× too optimistic, it will trust odometry over good lidar corrections and drift confidently into a wall. Tuning the `Q` (process) and `R` (measurement) noise is not a detail — it is the job. Everything below is a commitment to one belief representation and one way to run predict/correct cheaply enough for a real robot. ## Odometry sources and drift Odometry is dead reckoning: integrating motion to estimate pose. Every source drifts; understanding *how* each drifts tells you which to fuse and which to trust. **Wheel odometry.** Integrate encoder ticks through a kinematic model. Cheap, high-rate (often 100–1000 Hz), and smooth — but it believes the wheels. Slip, skid, uneven tire diameter, and the dreaded *kidnapped* push corrupt it instantly, and the heading error integrates into unbounded position error. On a flat floor it is excellent for the *short term* and useless for the long term. See [mobile robots](/posts/mobile-robots-amr-agv-ultimate-guide/) for the drive geometries behind it. **Inertial (IMU).** A gyro measures angular rate; an accelerometer measures specific force. Integrate the gyro once for orientation, the accelerometer twice for position. The double integration is brutal: a constant accel bias of just 0.01 m/s² grows into a position error of `½·0.01·t²` ≈ 0.5 m after 10 s and 2 m after 20 s. Orientation drifts more slowly (single integration of gyro bias), and the gravity vector gives you an absolute roll/pitch reference, but heading (yaw) drifts freely without a magnetometer or external fix. IMUs are unbeatable for the very short term and high-frequency motion; they are why VIO and LIO work. **Visual odometry (VO).** Track features (or pixel intensities) across frames and solve for the camera motion that explains the apparent motion. Drifts far slower than wheels or raw IMU, but accumulates **scale drift** (a single camera cannot observe absolute scale) and breaks in low texture, motion blur, and bad lighting. Fuse it with an IMU and the scale becomes observable — that is visual-inertial odometry. **Lidar odometry (LO).** Register consecutive scans (ICP/NDT) to estimate motion. Geometrically accurate and metric (lidar measures real distance), robust to lighting, but degenerate where geometry is ambiguous — a long featureless corridor, a flat field — and heavy on compute. Fuse with an IMU → lidar-inertial odometry, the basis of LIO-SAM and FAST-LIO2. ```text Why double integration is the IMU's curse: accel bias b = 0.01 m/s² (a good MEMS IMU, uncorrected) velocity error = b · t position error = ½ · b · t² t = 1 s → 0.005 m (fine) t = 10 s → 0.5 m (clipping shelves) t = 60 s → 18 m (lost) → The IMU MUST be corrected by an exteroceptive sensor. ``` > **Rule of thumb:** no odometry source is good at everything. The IMU is great at high-frequency, short-term motion and terrible at low-frequency drift; lidar/vision are the reverse. Fusing them — fast IMU prediction, slower exteroceptive correction — is why modern inertial-aided stacks dominate. SLAM then bounds even the fused drift with loop closure. ## Filtering vs optimization: the great split There are two grand strategies for maintaining the belief. The history of SLAM is largely the migration from the first to the second. ### EKF-SLAM The original. Represent the belief as one big Gaussian over `[robot pose, all landmark positions]`, and run an Extended Kalman Filter: linearize the nonlinear motion and observation models around the current estimate, predict, then correct on each landmark observation. ```text EKF predict/update sketch (state x, covariance P): predict: x⁻ = f(x, u) # nonlinear motion model P⁻ = F·P·Fᵀ + Q # F = ∂f/∂x (Jacobian), Q = process noise update (observe landmark j): y = z − h(x⁻) # innovation (residual) S = H·P⁻·Hᵀ + R # H = ∂h/∂x, R = measurement noise K = P⁻·Hᵀ·S⁻¹ # Kalman gain x = x⁻ + K·y P = (I − K·H)·P⁻ ``` EKF-SLAM works and was the field's backbone into the 2000s, but it has a fatal scaling property: the covariance `P` is dense (every landmark becomes correlated with every other), so the update is `O(n²)` in the number of landmarks `n`. A few hundred landmarks is fine; tens of thousands is not. It also linearizes *once* per step around a possibly-wrong estimate and can never undo that linearization error, which makes it brittle on large loops. The UKF (unscented) variant avoids explicit Jacobians and handles nonlinearity better, but the `O(n²)` scaling and the single-pass linearization remain. ### Particle filters and FastSLAM A particle filter represents the belief as a cloud of weighted samples, each a hypothesis of the full state. Predict by pushing every particle through the motion model (with noise); correct by reweighting each particle by how well it explains the measurement; periodically resample to kill low-weight particles. No Gaussian assumption — it can represent multi-modal beliefs (e.g. "I'm either in room A or the identical room B"), which is exactly what global localization needs. **FastSLAM** is the clever application to mapping: it factorizes the problem so each particle carries its own map of independent EKF-tracked landmarks (Rao-Blackwellization). It scales far better than EKF-SLAM and powered **GMapping**, the classic 2D grid SLAM. The catch is *particle depletion*: on a long loop the diversity collapses, the true hypothesis gets resampled away, and the map tears. For **localization against a known map**, the particle filter is still the right tool — this is **Monte-Carlo Localization (MCL)**, and **AMCL** (Adaptive MCL, the ROS standard) is its production form. It handles the multi-modal "where am I globally?" question and the kidnapped-robot recovery that a Gaussian filter cannot. AMCL is mapping's retired cousin: great for localization, not for building the map. ### Factor-graph / pose-graph SLAM The modern default. Do not maintain a running filtered estimate at all. Instead, accumulate every measurement as a **constraint (factor)** in a graph whose **nodes** are the things you want to estimate (poses, landmarks) and whose **edges** are the constraints between them (odometry between consecutive poses, a loop closure between distant poses, a landmark observation). Then solve for the configuration of all nodes that minimizes the total weighted residual — a big nonlinear least-squares problem. ```text Pose-graph optimization (the cost being minimized): X* = argmin_X Σ_ij rᵢⱼ(xᵢ, xⱼ)ᵀ · Ωᵢⱼ · rᵢⱼ(xᵢ, xⱼ) rᵢⱼ = error of edge (i,j): residual between the MEASURED relative transform zᵢⱼ and the one PREDICTED by current poses xᵢ, xⱼ Ωᵢⱼ = information matrix (inverse covariance) — how much to trust edge ij Solved by Gauss-Newton / Levenberg-Marquardt over the manifold SE(2)/SE(3). The Jacobian is SPARSE → exploit it (Cholesky) → scales to 10⁵+ nodes. ``` Why it won: it **relinearizes** every iteration (so it recovers from bad initial guesses where the EKF cannot), it is **sparse** (an odometry-and-loop graph is nowhere near fully connected, so factorization is fast), and it estimates the *whole trajectory* so a single loop closure corrects everything at once. The backends are mature and battle-tested: **GTSAM** (factor graphs, incremental solving via iSAM2), **g2o** (the classic general graph optimizer), and **Ceres** (Google's general nonlinear least-squares, used by Cartographer and VINS). iSAM2's incremental update is what makes graph SLAM real-time: it only re-solves the part of the graph a new factor actually touches. | Property | EKF-SLAM | Particle filter / FastSLAM | Factor-graph / pose-graph SLAM | |---|---|---|---| | Belief representation | Single Gaussian (mean + cov) | Weighted samples (particles) | MAP point estimate from a graph | | Multi-modal? | No | Yes (its main strength) | No (single estimate) | | Scaling in landmarks/poses | `O(n²)` (dense covariance) | `O(particles × map)`; depletion risk | Sparse, `O(n)`-ish; `10⁵+` nodes | | Linearization | Once per step, never undone | N/A (sampling) | Relinearized every iteration | | Loop closure handling | Poor on large loops | Causes depletion | Excellent — corrects whole trajectory | | Recovers from bad init | Weakly | Yes (resampling) | Yes (re-optimization) | | Best modern use | Sensor fusion (small state) | **Known-map localization (AMCL)** | **Default for building maps** | | Real systems | `robot_localization` EKF | GMapping, AMCL/MCL | Cartographer, slam_toolbox, LIO-SAM, ORB-SLAM3, VINS | > **Rule of thumb:** in 2026, build maps with a factor graph, fuse fast proprioceptive sensors with an EKF, and localize against a known map with a particle filter. Using an EKF to build a large landmark map, or a particle filter to map a whole building, is fighting the tooling. ## Front-end vs back-end Every serious SLAM system has two halves, and confusing them is how teams misdiagnose problems. **The front-end** is perception and data association. It turns raw sensor data into constraints: it extracts features or keypoints, matches them across frames, runs scan matching to estimate relative motion, and — critically — decides *which* measurements correspond to *which* landmarks (data association) and *whether* the current view matches a past one (loop-closure detection). The front-end is sensor-specific (a lidar front-end and a camera front-end share almost no code) and it is where the hard, brittle decisions live. **The back-end** is the optimizer. It takes the constraints the front-end produced and finds the trajectory and map that best satisfy them — the factor-graph optimization above, or the filter update. The back-end is mostly sensor-agnostic linear algebra; GTSAM does not care whether an edge came from a lidar or a camera. The reason this split matters operationally: > **Rule of thumb:** the back-end is rarely your problem. Almost every SLAM failure in the field is a front-end failure — a bad scan match in a degenerate corridor, a wrong data association, or a *false loop closure*. A single false loop closure is catastrophic: it tells the optimizer two genuinely-distant places are the same, and the back-end faithfully folds your map in half. This is why robust back-ends added **outlier-rejection** machinery: switchable constraints, dynamic covariance scaling, graduated non-convexity (GNC), and Cauchy/Huber robust kernels that let the optimizer down-weight a constraint that disagrees violently with everything else. They are insurance against the front-end's worst mistakes. But the right primary defense is a front-end that does not generate garbage: good features, geometric verification of loop candidates (RANSAC on the matched points), and consistency checks before a loop closure is allowed into the graph. ## Scan matching and lidar SLAM stacks Lidar SLAM starts from one operation: given two point clouds, find the rigid transform that aligns them. That is **scan matching**, and it is the lidar front-end's core. ### ICP and NDT **Iterative Closest Point (ICP)** alternates two steps until convergence: (1) for each point in scan B, find the closest point in scan A; (2) solve for the transform that minimizes the summed distances; repeat. **Point-to-point** ICP minimizes point distances; **point-to-plane** ICP minimizes the distance from each point to the local surface tangent of its match, which converges faster and is the practical default for structured environments. ICP is accurate when the initial guess is good (feed it the IMU or odometry prior) and fragile when it is not — it falls into local minima and needs a decent prior to seed it. **Normal Distributions Transform (NDT)** takes a different tack: voxelize the reference cloud and model each voxel as a Gaussian, then align the new scan by maximizing the likelihood of its points under that field of Gaussians. NDT is smoother (it optimizes a continuous, differentiable cost rather than discrete correspondences), often more robust to a poor initial guess, and a common choice for outdoor automotive lidar registration. ### The 2D stacks **slam_toolbox** is the 2D lidar SLAM default in [ROS 2](/posts/ros2-ultimate-guide/) today. It is pose-graph SLAM: scan matching for odometry, a graph back-end (Ceres) for optimization, and a scan-matching loop-closure detector. Crucially it supports **lifelong mapping** — load a saved graph, keep mapping, and serialize the pose graph so you can re-localize and continue later. For a flat-floor indoor AMR it is the safe, well-supported choice, and it cleanly hands off to AMCL for production localization. **Cartographer** (originally Google) is the other heavyweight, available in 2D and 3D. Its architecture is distinctive: the front-end builds small **submaps** (each a little local occupancy grid) by scan-matching incoming scans into the current submap; the back-end runs **branch-and-bound** scan matching to detect loop closures against all finished submaps, then optimizes a sparse pose graph (Ceres) over submap and scan poses. The submap design makes loop closure efficient and the maps crisp. It is heavier to tune than slam_toolbox but produces excellent results, and it handles 3D backpack/handheld mapping well. ### The 3D inertial stacks For 3D, fast, or 6-DoF platforms, the modern systems couple the lidar with the IMU tightly. **LIO-SAM** (lidar-inertial odometry via smoothing and mapping) is a factor-graph system built on GTSAM. It pre-integrates IMU between lidar keyframes for a strong motion prior, extracts edge and planar features (LOAM-style), scan-matches against a local map, and adds IMU pre-integration factors, lidar odometry factors, optional GPS factors, and loop-closure factors to the graph. It is accurate and a strong outdoor/ground-vehicle choice, and the GPS factor makes geo-referenced mapping straightforward. **FAST-LIO2** is the efficiency benchmark. It is a tightly-coupled iterated EKF (not a graph) that — and this is the key idea — registers *raw* points directly against the map with no feature extraction, using an incremental k-d tree (**ikd-Tree**) to keep the map queryable in real time. The math (a clever Kalman gain formulation) makes the EKF update cost scale with measurement dimension rather than state dimension, so it runs at high rate on modest compute, even on a small embedded CPU. It is odometry-grade (no built-in large-loop closure), so people pair it with a separate loop-closure/pose-graph layer when they need a globally consistent map. If you need real-time 3D state estimation on a drone or quadruped with limited compute, FAST-LIO2 is the one to beat. See [legged robots](/posts/legged-quadruped-robot-hardware-ultimate-guide/) for those platforms. | System | Dim | Approach | IMU coupling | Loop closure | Backend | Best for | |---|---|---|---|---|---|---| | **slam_toolbox** | 2D | Pose-graph, scan match | None (uses odom) | Yes (scan match) | Ceres | Indoor flat-floor AMRs; lifelong mapping | | **Cartographer** | 2D/3D | Submaps + branch-and-bound | Optional | Yes (vs submaps) | Ceres | Crisp maps, handheld/backpack 3D | | **LIO-SAM** | 3D | Feature LIO, factor graph | Tight (pre-integration) | Yes | GTSAM | Outdoor ground vehicles, geo-referenced | | **FAST-LIO2** | 3D | Direct LIO, iterated EKF | Tight | No (add a layer) | iEKF + ikd-Tree | Real-time on light compute (drones, legged) | > **Rule of thumb:** for a flat indoor robot, slam_toolbox. For 3D on real compute with loops you care about, LIO-SAM. For 3D on a weight/compute budget, FAST-LIO2 plus a separate loop-closure layer. Cartographer when you want the cleanest maps and will pay the tuning cost. ## Visual SLAM and visual-inertial odometry Cameras are cheap, light, low-power, and information-dense — and that is exactly why visual SLAM is harder than lidar SLAM. A camera gives you bearing but not range (a monocular camera cannot see scale at all), it dies in the dark and in low texture, and motion blur destroys it. The payoff is rich data for loop closure and a sensor that costs and weighs almost nothing. See [machine vision](/posts/machine-vision-ultimate-guide/) for the imaging fundamentals. ### Feature-based vs direct **Feature-based** methods detect repeatable keypoints (ORB, SIFT-like), describe them, match them across frames, and optimize camera poses and 3D point positions to minimize **reprojection error** (the pixel distance between where a 3D point projects and where it was observed). They throw away most of the image and keep a sparse set of robust points. Fast, mature, and good at loop closure (the descriptors double as a place-recognition vocabulary). **Direct** methods (LSD-SLAM, DSO) skip features entirely and optimize **photometric error** — the raw intensity difference — over (semi-)dense pixels. They use more of the image, handle low-texture scenes where features are sparse, and produce denser maps, but they are sensitive to brightness changes, rolling shutter, and need good photometric calibration. In production, feature-based has been the more robust workhorse; direct is excellent where it fits. **ORB-SLAM3** is the reference feature-based system, and it is genuinely good: monocular, stereo, and RGB-D; with or without an IMU (it is a full visual-inertial system too); a multi-map system (**Atlas**) that can lose tracking, start a new map, and later merge maps when it recognizes a connection; and DBoW2 bag-of-words loop closure and relocalization. If you want to understand visual SLAM, read ORB-SLAM3. ### Visual-inertial odometry (VIO) A monocular camera cannot observe scale or absolute roll/pitch; an IMU can (gravity gives roll/pitch, acceleration gives metric scale). Fuse them and you get **VIO** — metric, gravity-aligned, robust to the brief moments the camera fails (blur, a passing truck). VIO is the workhorse for drones, AR/VR headsets, and weight-constrained robots. The central design choice is coupling: - **Loose coupling** runs the visual estimator and the IMU estimator separately and fuses their *outputs* (e.g. a VO pose into an EKF that also integrates IMU). Simpler, modular, but it throws away cross-information and is less robust. - **Tight coupling** puts raw IMU pre-integration and visual feature measurements into *one* estimator (one factor graph or one filter) and solves jointly. More accurate, better at recovering scale and biases, more robust to degeneracy — and the clear winner for serious systems. **VINS-Fusion** (HKUST) is the standard tightly-coupled, optimization-based VIO — monocular-inertial, stereo, stereo-inertial; sliding-window nonlinear optimization (Ceres) with IMU pre-integration, plus a separate pose-graph loop-closure module (DBoW2). It is the system most VIO work is compared against. **OpenVINS** is the standard tightly-coupled, *filter*-based VIO — a Multi-State Constraint Kalman Filter (MSCKF). It is lighter than full optimization, extremely well-documented, and a favorite research and embedded baseline. The MSCKF trick is to keep a sliding window of past poses in the state and marginalize features cleverly, getting most of optimization's accuracy at filter cost. **RTAB-Map** is the pragmatic, batteries-included option: an RGB-D / stereo graph-SLAM system with strong appearance-based loop closure and built-in memory management (it pages old parts of the map out of working memory to stay real-time on big maps). It is less a research baseline and more the thing you reach for when you have an RGB-D camera and want a dense map and ROS integration without assembling a stack yourself. | System | Type | Sensors | Coupling | Loop closure | Notes | |---|---|---|---|---|---| | **ORB-SLAM3** | Feature, optimization | Mono/stereo/RGB-D (+IMU) | Tight (VI mode) | DBoW2 + multi-map merge | The reference visual SLAM | | **VINS-Fusion** | Feature, optimization | Mono/stereo (+IMU) | Tight | DBoW2 (separate module) | The VIO optimization standard | | **OpenVINS** | Feature, filter (MSCKF) | Mono/stereo + IMU | Tight | Limited (it's odometry) | Light, well-documented VIO baseline | | **RTAB-Map** | Feature, graph | RGB-D / stereo (+lidar) | Loose-ish | Appearance-based, strong | Batteries-included, dense maps, memory mgmt | | **DSO / LSD-SLAM** | Direct | Mono | — | LSD: yes; DSO: no | Dense-ish, low-texture-tolerant, calib-sensitive | | Aspect | Lidar SLAM | Visual / VI SLAM | |---|---|---| | Range info | Direct, metric | Bearing only (mono); metric with stereo/RGB-D/IMU | | Lighting | Indifferent (active) | Fails in dark / strong texture changes | | Texture dependence | Needs geometry, not texture | Needs texture, not geometry | | Degenerate case | Featureless corridor, open field | Blank wall, low light, motion blur | | Loop closure | Geometric (scan/submap match) | Appearance (bag-of-words) — very strong | | Cost / weight / power | Higher (esp. 3D lidar) | Low (camera + IMU is cheap and light) | | Map richness | Geometry, sparse semantics | Dense texture, semantics-friendly | | Typical platform | AMRs, AGVs, AVs, large robots | Drones, AR/VR, humanoids, cost-sensitive | > **Rule of thumb:** if you can afford the lidar and weight, lidar-inertial is the more robust map-builder. If weight, cost, or power rule out lidar — drones, headsets, consumer robots — go visual-inertial and couple the IMU tightly. The best fielded systems on big robots fuse both, so each covers the other's degenerate case. See [LiDAR & depth cameras](/posts/lidar-depth-cameras-ultimate-guide/) for the fusion argument. ## Loop closure and place recognition This is the single feature that separates SLAM from "odometry that draws a map." Without it, your trajectory drifts steadily and the map smears; you can drive a perfect square and have the start and end points 3 m apart. **Loop closure** recognizes that you have returned to a previously-visited place and adds a constraint tying the current pose to the old one. The back-end then redistributes the accumulated error across the whole loop, snapping the map into consistency. The hard part is *recognizing the place* — **place recognition** — fast and without false positives. **Bag-of-words (visual).** The classic approach (DBoW2, used by ORB-SLAM3 and VINS) quantizes feature descriptors into a precomputed visual "vocabulary," so each image becomes a sparse histogram of visual words. Comparing two images is then a fast vector comparison, and you can index thousands of past keyframes and query them in milliseconds. A candidate match is then **geometrically verified** (match the actual features, run RANSAC, require enough consistent inliers) before it is allowed to become a loop-closure constraint. The verification step is non-negotiable — bag-of-words alone produces perceptual-aliasing false matches. **Geometric loop detection (lidar).** Lidar stacks detect loops by scan/submap matching against past poses (Cartographer's branch-and-bound, LIO-SAM's radius search + ICP) or with global descriptors like **Scan Context** that summarize a 3D scan into a rotation-invariant signature for fast candidate retrieval. Same pattern: cheap candidate retrieval, then expensive geometric verification. > **Rule of thumb:** be conservative with loop closures. A missed loop closure costs you some drift you can fix on the next pass; a *false* loop closure corrupts the entire map irreversibly. Require strong geometric verification and use robust back-end kernels (GNC, switchable constraints) as a second line of defense. The deep enemy of place recognition is **perceptual aliasing** — different places that look identical (every aisle in a warehouse, every floor of a parking garage, a row of identical office doors). This is exactly where appearance-based recognition produces confident false matches, and it is why symmetric and repetitive environments are so hard. Learned global descriptors (NetVLAD and successors) are more robust to viewpoint and lighting change than classic bag-of-words and are increasingly common in 2026 stacks, but they do not eliminate aliasing — geometric verification still has the final say. ## Map representations The map is not an afterthought; its representation determines what your planner can do and what your robot can afford to store. Choose it for the *consumer* — the planner, the localizer, the human — not for the sensor. **Occupancy grid (2D / 3D).** Discretize space into cells, each holding the probability it is occupied. The standard for 2D navigation and the input AMCL and most 2D planners expect. Simple, supports ray-casting for localization, and directly answers "is this cell free?" The cost is memory, and in 3D it explodes. **OctoMap** mitigates the 3D cost with an octree that stores free/occupied space at adaptive resolution (large empty regions collapse to one node). **Point cloud.** The raw-ish output of lidar/depth SLAM — a set of 3D points, optionally with intensity or color. Dense, accurate, great for 3D registration and visualization, but unstructured (no explicit free space, no connectivity) and heavy. Most 3D lidar SLAM maps are point clouds; you down-sample (voxel grid) hard before storage. **Mesh / surfel.** Reconstruct surfaces as triangles or oriented disks (surfels). Compact for surfaces, great for rendering, manipulation, and human consumption — and the natural output of dense RGB-D fusion (TSDF-based methods). More processing to build and maintain. **Topological / semantic.** A graph of places and connections ("kitchen → hallway → lab") rather than metric geometry. Tiny, robust to metric error, ideal for high-level task planning and very large environments — but you cannot servo to a millimetre with it. The strong systems are **hybrid**: metric maps locally, topological structure globally. ```text Occupancy-grid memory math (why 3D hurts): 2D grid, 5 cm resolution, 100 m × 100 m: cells = (100/0.05)² = 2000 × 2000 = 4,000,000 @ 1 byte/cell (8-bit log-odds) ≈ 4 MB # trivial 3D dense grid, 5 cm resolution, 100 m × 100 m × 10 m: cells = 2000 × 2000 × 200 = 800,000,000 @ 1 byte/cell ≈ 800 MB # painful @ 5 cm over a 200 m × 200 m × 20 m site ≈ 12.8 GB # unworkable dense → 3D wants an octree (OctoMap): empty space collapses, so the real cost scales with SURFACE area, not volume — often 10–100× smaller. ``` | Representation | Memory | Free space? | Planner fit | Best for | |---|---|---|---|---| | 2D occupancy grid | Low (MBs) | Explicit | Excellent (2D) | Flat-floor indoor navigation, AMCL | | 3D occupancy / OctoMap | Medium (octree) | Explicit | Good (3D) | 3D collision checking, aerial/legged | | Point cloud | High | No | Poor directly | Registration, 3D viz, source for other maps | | Mesh / surfel | Medium (surface) | Surface only | Manipulation/render | Dense reconstruction, AR, grasping | | Topological / semantic | Tiny | Abstract | High-level only | Task planning, very large environments | > **Rule of thumb:** localize against a compact map (2D grid, sparse landmarks), plan against an occupancy map, and keep the dense point cloud only if a downstream consumer (manipulation, inspection, reconstruction) actually needs it. Carrying a full dense cloud around just to navigate a flat floor is wasted memory and CPU. ## The sensor and compute budget SLAM is a real-time system competing for the same CPU as perception, planning, and control. The budget is real, and it is where elegant algorithms meet shipping deadlines. **Sensors set the ceiling.** No algorithm recovers information the sensors did not capture. A good IMU (low bias instability, e.g. an industrial-grade MEMS at a few deg/hr) is worth more to a VIO/LIO system than a fancier optimizer on a cheap IMU. A 3D lidar at 1.3–2.6 M points/s, a global-shutter camera (rolling shutter wrecks VIO unless modeled), and **time-synchronized, calibrated** sensors are the foundation. The two most common silent killers: an uncalibrated camera-IMU **extrinsic** (the rigid transform between them) and **unsynchronized timestamps**. A few-millisecond timing offset between camera and IMU degrades VIO badly; SLAM systems include online time-offset estimation precisely because this is so common. > **Rule of thumb:** spend the calibration effort before you blame the algorithm. Intrinsics, extrinsics, and time synchronization account for a large fraction of "this SLAM system is bad" reports. Kalibr-style calibration for VIO and a careful extrinsic for LIO are not optional. **Compute splits front-end and back-end.** The front-end (feature extraction, scan matching) runs every frame and must keep up with the sensor rate; the back-end (graph optimization, loop closure) can run slower and asynchronously. This is why systems separate them onto different threads — the odometry stays real-time while the optimizer catches up in the background. FAST-LIO2 exists largely because that front-end loop must fit on small compute; ORB-SLAM3 and VINS run the heavy optimization in a back thread so tracking never stalls. **Rough numbers (2026, order-of-magnitude, platform-dependent):** - 2D lidar SLAM (slam_toolbox): comfortable on a modern quad-core ARM/x86; modest RAM. - VIO (OpenVINS/VINS): real-time on an embedded x86 or a Jetson-class board; tight but feasible. - 3D LIO (FAST-LIO2): designed to run on a single modern CPU core at lidar rate; LIO-SAM wants more for the graph. - Dense reconstruction (TSDF/mesh): wants a GPU. See [real-time control](/posts/real-time-control-systems-ultimate-guide/) for how SLAM coexists with the deterministic loops it must not starve, and [robot sensors](/posts/robot-sensors-ultimate-guide/) for the upstream sensing. ## Degeneracy and failure cases Knowing how SLAM breaks is more useful than knowing how it works, because the breakage is where your robot ends up against a wall. **Featureless / geometrically degenerate environments.** A long, straight, featureless corridor is the textbook lidar killer: scans constrain your lateral position and heading but say *nothing* about how far you have travelled along it — the problem is **under-constrained** in one direction. The scan matcher slides freely and reports false confidence. Open fields, tunnels, and large flat walls do the same. The defense is to detect degeneracy (monitor the conditioning of the optimization — small eigenvalues of the information matrix flag an unobservable direction) and lean on the IMU/wheel odometry through it. **Textureless / low-light scenes (visual).** Blank walls, white-out fog, darkness, and uniform surfaces starve a feature tracker. Direct methods help a little; an IMU helps a lot (it coasts through brief outages); but a camera-only system in a dark featureless space is simply blind. **Dynamic scenes.** SLAM's core assumption is a *static* world. People, forklifts, other robots, and opened doors violate it. Features tracked on a moving object pull your pose estimate with them, and moving objects get baked into the map as phantom obstacles. Defenses: detect and reject dynamic objects (semantic segmentation, RANSAC outlier rejection treating movers as outliers), use short map memory so transients fade, and weight the static structure. A busy warehouse aisle at shift change is a genuinely hard case. **Perceptual aliasing.** Covered above — repetitive environments fool place recognition into false loop closures. The most dangerous failure because it corrupts the *whole* map, not just the current pose. **The kidnapped-robot problem.** The robot is picked up and moved (or the localizer simply loses track). A filter that has converged to a tight Gaussian around the wrong pose cannot recover — it is too confident. This is precisely why AMCL is a *particle* filter with injected random particles and adaptive sampling: it keeps enough hypothesis diversity to re-converge when the world contradicts it. Pure dead-reckoning has no recovery at all. **Glass and mirrors.** Lidar passes through glass (no return, or a return from beyond it) and sees a mirror as a tunnel into a false room; cameras see reflections as real geometry. Both corrupt the map. Mark known glass, or fuse a sensor that sees it (some radar, ultrasonic). > **Rule of thumb:** never deploy a single-modality SLAM stack in an environment that can starve that modality. The cheapest robustness upgrade is almost always a well-calibrated IMU tightly coupled to your primary sensor — it carries you through the brief degeneracies that would otherwise lose the pose. ## 2D vs 3D, indoor vs outdoor The right stack depends on the dimensionality of the world your robot actually lives in. **2D, indoor, flat floor.** An AMR on a warehouse or hospital floor moves in `(x, y, θ)`. A 2D lidar at sensor height plus a 2D occupancy grid is the mature, cheap, robust answer: slam_toolbox to build the map, AMCL to localize against it in production. Do not pay for 3D you do not use. The one caveat: a 2D lidar at a fixed height is blind to overhangs and low obstacles — pair it with a depth camera for obstacle avoidance even if SLAM stays 2D. This is the bread-and-butter case for most of the robots in the [AMR/AGV guide](/posts/mobile-robots-amr-agv-ultimate-guide/). **3D, outdoor or uneven.** The moment the robot pitches and rolls — outdoor terrain, ramps, stairs, drones, legged platforms — you need full 6-DoF state, a 3D lidar or VIO, and an IMU. The ground is not a plane, gravity is not always "down" in the body frame, and a 2D assumption produces nonsense. FAST-LIO2 / LIO-SAM for lidar platforms, VINS/OpenVINS for visual ones. **Outdoor adds GPS/GNSS.** Outdoors you usually have a global fix (GNSS, RTK for centimetre accuracy), which changes the problem: you no longer need loop closure to bound global drift because GPS provides absolute position directly. The modern pattern is to fuse GNSS as a factor in the graph (LIO-SAM's GPS factor) — local lidar/visual SLAM for smooth, high-rate, locally-consistent motion; GNSS for the global anchor that kills long-term drift. Indoors you have no such anchor, which is exactly why indoor SLAM leans so hard on loop closure. > **Rule of thumb:** match the algorithm's dimensional assumptions to the physical world. A 2D stack on a flat floor is a feature (simple, robust, cheap); a 2D stack on a robot that pitches is a bug. And if you have GNSS, use it — an absolute anchor is worth more than the cleverest loop-closure detector. ## Selecting a stack and the Nav2 tie-in Put it together as a decision procedure rather than a popularity contest. **1. Do you even need SLAM, or just localization?** If the environment is stable and you can map it once, map it once (online or from a recorded bag), freeze the map, and run *localization* in production. This is the common production architecture and it is far more robust than mapping forever. **2. 2D or 3D?** Flat floor, planar motion → 2D. Pitch/roll, terrain, flight, stairs → 3D and an IMU. **3. What's your primary exteroceptive sensor and budget?** - 2D lidar, indoor, cost-conscious → **slam_toolbox** (build) + **AMCL** (localize). - 3D lidar, real compute, want loops/geo-reference → **LIO-SAM**. - 3D lidar, tight compute/weight (drone, legged) → **FAST-LIO2** (+ a loop-closure layer). - Camera + IMU, weight/cost dominate → **VINS-Fusion** or **OpenVINS**; **ORB-SLAM3** if you want maps + relocalization. - RGB-D, want a dense map with minimal assembly → **RTAB-Map**. - Cleanest 2D/3D maps, willing to tune → **Cartographer**. **4. Always add the IMU.** Across every modern stack, tightly coupling a calibrated IMU is the highest-ROI robustness improvement. Budget for the calibration. ### The Nav2 / ROS 2 tie-in In a [ROS 2](/posts/ros2-ultimate-guide/) navigation system the pieces have clean, standardized seams, and SLAM slots into a well-defined place: - **Mapping mode:** run slam_toolbox (or Cartographer) → it publishes the `map → odom` transform and a `nav_msgs/OccupancyGrid`, and your wheel/inertial odometry publishes `odom → base_link` via an EKF (`robot_localization`). Save the map. - **Localization mode:** load the saved map, run **AMCL** → it corrects the `map → odom` transform by matching live scans to the frozen map. Your odometry source still provides the smooth `odom → base_link`. - **The TF tree is the contract.** `map → odom → base_link → sensors`. SLAM/AMCL own `map → odom` (the drift correction); odometry owns `odom → base_link` (smooth, high-rate, drifting); the URDF owns the rest. Nav2's costmaps, planners, and controllers consume the result. See the [ROS 2 guide](/posts/ros2-ultimate-guide/) for the TF and the [motion planning guide](/posts/motion-planning-kinematics-ultimate-guide/) for what the planner does with the map. > **The honest bottom line:** SLAM in 2026 is a solved-enough problem that you should almost never write your own. Pick the stack that matches your dimensionality, sensor, and compute; couple the IMU tightly; spend the calibration and synchronization effort up front; map once and localize in production; and treat loop closure as something to do carefully, not aggressively. Do that and you will spend your engineering on your robot's actual job, not on rediscovering why the corridor ate your pose. ## Frequently asked questions **What is the difference between SLAM and localization?** Localization assumes you already have a map and answers "where am I in it?" SLAM builds the map and estimates your trajectory *at the same time*, with no prior map. In practice you run SLAM once to build the map, freeze it, then run localization (e.g. AMCL) in production. **Is SLAM a solved problem?** For common cases — indoor flat-floor 2D, well-lit visual-inertial, 3D lidar in feature-rich environments — yes, with mature open-source stacks. It is *not* solved for long-term operation in highly dynamic, changing, perceptually-aliased, or sensor-degenerate environments. Lifelong SLAM and robustness to change are still active problems. **EKF-SLAM, particle filter, or graph SLAM — which should I use?** For building maps, graph (factor-graph/pose-graph) SLAM is the modern default. Use a particle filter for localizing against a *known* map (AMCL/MCL), where its multi-modal belief handles global localization and the kidnapped-robot problem. Use an EKF/UKF for fusing fast proprioceptive sensors (wheel + IMU + GPS) into smooth odometry, not for mapping. **Why does loop closure matter so much?** Without it, every SLAM system is just odometry that draws a map, and odometry drifts without bound — you can return to your start and be metres off. Loop closure recognizes a revisited place and adds a constraint that lets the optimizer redistribute accumulated error across the whole loop, snapping the map into global consistency. **Do I need a lidar, or is a camera enough?** A camera plus a well-calibrated IMU (visual-inertial) is enough for many robots and is far cheaper and lighter — it is the standard for drones, headsets, and cost-sensitive platforms. Lidar is more robust (active, metric, lighting-indifferent) and better in low texture, at the cost of price, weight, and power. Big robots that can afford both fuse them. **What is the difference between front-end and back-end?** The front-end turns raw sensor data into constraints (feature tracking, scan matching, data association, loop detection) and is sensor-specific. The back-end optimizes the graph of those constraints (GTSAM, g2o, Ceres) and is mostly sensor-agnostic. Most real-world SLAM failures are front-end failures, especially false loop closures. **Why does my robot's pose drift even with SLAM running?** Between loop closures, the back-end can only do as well as the odometry constraints, so some drift is expected on the open trajectory. Persistent or large drift usually means a degenerate environment (featureless corridor), a bad sensor calibration/extrinsic, an uncalibrated or noisy IMU, or no loop closures being detected. Check calibration and synchronization first. **What is the kidnapped-robot problem?** The robot is moved without odometry registering it, or the localizer otherwise loses track. A converged Gaussian filter is too confident to recover. AMCL is a particle filter specifically because injecting random particles and adaptive sampling let it re-converge when sensor data contradicts its current belief — that hypothesis diversity is the recovery mechanism. **Why is a featureless corridor so hard for lidar SLAM?** The problem becomes under-constrained: scans pin down your lateral position and heading but provide no information about distance travelled *along* the corridor, so the scan matcher slides freely with false confidence. Detect the degeneracy (small eigenvalues in the information matrix) and rely on IMU/wheel odometry through it. **How much memory does a SLAM map need?** A 2D occupancy grid is cheap — a 100×100 m area at 5 cm is about 4 MB. A dense 3D voxel grid explodes — the same footprint with 10 m of height is hundreds of MB, and a large site is unworkable dense, which is why 3D uses octrees (OctoMap) that collapse empty space and scale with surface area, not volume. **Loose vs tight coupling in visual-inertial systems?** Loose coupling fuses the *outputs* of separate visual and inertial estimators (simpler, less accurate). Tight coupling puts raw IMU and visual measurements into one estimator and solves jointly (more accurate, better scale/bias observability, more robust to brief outages). Serious VIO systems — VINS-Fusion, OpenVINS, ORB-SLAM3's VI mode — are all tightly coupled. **How does SLAM fit into ROS 2 and Nav2?** SLAM (slam_toolbox/Cartographer) publishes the map and owns the `map → odom` transform during mapping; AMCL owns it during localization against a saved map. Your odometry (an EKF over wheel + IMU) owns `odom → base_link`. Nav2's costmaps, planners, and controllers consume the resulting map and TF tree. The standardized TF contract is what lets these pieces swap cleanly. ## Changelog - **2026-06-12** — Initial publication. --- # Robot Gearboxes: Harmonic & Cycloidal Drives — The Ultimate Guide URL: https://blog.robo2u.com/posts/gearboxes-harmonic-cycloidal-ultimate-guide/ Published: 2026-06-11 Updated: 2026-06-20 Tags: gearboxes, harmonic-drive, strain-wave, cycloidal-drive, planetary-gearbox, backlash, gear-reduction, robotics-hardware, guide Reading time: 36 min > A working engineer's guide to robot gear reduction: harmonic (strain-wave), cycloidal RV, and planetary drives compared on ratio, backlash, stiffness, efficiency, shock tolerance, and where each one actually belongs in a robot. A robot is a stack of motors trying to act like muscles, and almost none of them can do it directly. An electric motor wants to spin fast and push lightly; a robot joint wants to move slowly and shove hard. The gearbox is the translator sitting between those two worlds, and it quietly decides more about your robot's behavior than the motor itself — how stiff the arm feels, how much it backlashes, whether it can be backdriven for force control, how loud it is, and how long it survives before the teeth wear out. Most engineers learn motors first and treat the gearbox as a catalog line item. That's backwards. Pick the wrong reduction technology and you'll fight backlash forever, or burn efficiency you can't afford on a battery, or watch a flexspline crack at 40% of its rated life because nobody checked the momentary peak torque. The three families you'll meet — planetary, harmonic (strain-wave), and cycloidal (RV) — are not interchangeable. Each is a different bet on the metrics that matter. **The take**: Harmonic drives own the wrist and the lightweight cobot joint because they give you 50:1 to 160:1 in one zero-backlash stage at low mass; cycloidal RV drives own the heavy proximal axes of industrial arms because they eat shock loads and stay stiff under big moments; planetary gearboxes own everything where cost and backdrivability matter more than arc-minutes. Choose by the joint, not by habit. Companion reading: [servo motors](/posts/servo-motors-ultimate-guide/), [robot actuators](/posts/robot-actuators-ultimate-guide/), [industrial robot arms](/posts/industrial-robot-arms-ultimate-guide/), and [collaborative robots / cobots](/posts/collaborative-robots-cobots-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [Why robots need gear reduction](#why-reduction) 3. [The metrics that actually matter](#metrics) 4. [Spur and planetary gearboxes](#planetary) 5. [Harmonic / strain-wave drives](#harmonic) 6. [Cycloidal drives](#cycloidal) 7. [Head-to-head: harmonic vs cycloidal vs planetary](#head-to-head) 8. [Backlash and how to fight it](#backlash) 9. [Backdrivability and the gear-ratio tradeoff](#backdrivability) 10. [Efficiency, heat and lubrication](#efficiency) 11. [Sizing and selecting a gearbox](#sizing) 12. [Where each gearbox shows up](#where-used) 13. [Failure modes, wear and maintenance](#failures) 14. [Frequently asked questions](#faq) ## Key takeaways - **Motors are high-speed, low-torque machines.** A typical BLDC servo wants to run at 3,000–6,000 rpm and makes a few tenths of a N·m continuous. Robot joints want roughly 1–100 rpm and tens to thousands of N·m. The gearbox bridges that 50–200× gap. - **Reduction multiplies torque by the ratio and divides speed by the ratio** (minus losses), and — the part people forget — it divides reflected motor inertia by the *square* of the ratio. That inertia term is why high ratios make a joint feel stiff and controllable. - **Backlash is the headline spec for precision.** Quality planetary: 1–6 arc-min. Cycloidal RV: ~1 arc-min lost motion. Harmonic: effectively zero backlash (the spec sheet says "<1 arc-sec" or "zero"), though it has hysteresis from flexspline compliance. - **Harmonic (strain-wave) drives** give 30:1 to 160:1 in a single, thin, coaxial stage at low mass with zero backlash. That combination is why nearly every cobot and industrial-arm wrist uses them. - **Cycloidal RV drives** trade a little backlash for huge shock tolerance (often ~5× rated torque momentarily), high torsional stiffness, and excellent moment capacity. They dominate the base, shoulder, and elbow of payload industrial arms. - **Planetary gearboxes** are the cost-and-density default: 3:1 to ~10:1 per stage, 70–93% efficiency depending on stages, backlash from <1 arc-min (preloaded precision) to >30 arc-min (economy), and they backdrive far better than the other two. - **Efficiency is not a footnote on a battery.** Harmonic drives run ~70–90% but drop hard at low load and cold temperatures; a 100:1 strain-wave at 20% load on a cold morning can dip below 50%. Cycloidal sits ~80–93%. Two-stage planetary ~80–88%. - **Backdrivability falls as ratio rises.** Roughly, a drive becomes hard to backdrive above ~30:1–50:1. Quasi-direct-drive (QDD) actuators deliberately stay at 6:1–10:1 to keep transparency for force control; high-ratio harmonic joints give it up for stiffness and holding torque. - **Torsional stiffness matters as much as backlash** for trajectory accuracy and vibration. Cycloidal and large harmonic units are stiff (tens to hundreds of kN·m/rad); small harmonic units are noticeably more compliant and that compliance shows up as path error under load. - **Size by torque AND inertia AND life.** Rated (continuous) torque, repeated peak (acceleration) torque, momentary peak (shock/e-stop) torque, average load over the duty cycle, and L10 bearing/fatigue life are five different numbers — and the smallest of the resulting sizes is rarely the right one. - **The real product landscape is concentrated.** Harmonic Drive LLC / Harmonic Drive SE dominate strain-wave; Nabtesco owns the RV cycloidal industrial-arm market; Spinea and Sumitomo offer cycloidal alternatives; Apex Dynamics, Neugart, Wittenstein/alpha, and Maxon cover planetary. - **Match the gearbox to the joint.** Wrist and forearm: harmonic. Base/shoulder/elbow of a payload arm: cycloidal RV. Legged-robot and force-controlled joints: low-ratio planetary / QDD. AGV wheels: planetary hub drives. Pick deliberately. ## Why robots need gear reduction Start from the physics of the prime mover. A permanent-magnet servo motor produces torque proportional to current and speed proportional to voltage; its power peaks somewhere in the thousands of rpm. The continuous torque of a NEMA-23-ish servo or a 40 mm frameless rotor is on the order of 0.1–1 N·m. A robot's elbow, by contrast, might need to hold 150 N·m static and move at 60–180 °/s (1–3 rev/s). You cannot get there directly without an absurdly large, heavy motor. So you trade speed for torque. An ideal gearbox of ratio *N* does three things at once: ``` Output torque = N × motor torque × efficiency Output speed = motor speed / N Reflected inertia at the motor = load inertia / N² ``` The first line is the obvious one and the reason gearboxes exist. The third line is the one that separates good robot designs from bad ones. ### Reflected inertia is the hidden prize When a motor drives a load through a reduction *N*, the load's inertia *as seen by the motor* shrinks by *N²*. Flip it around: the motor's own rotor inertia, *as seen by the joint*, grows by *N²*. ``` Example: motor rotor inertia Jm = 5e-5 kg·m² link inertia at joint Jl = 0.5 kg·m² ratio N = 100 Load inertia reflected to motor = Jl / N² = 0.5 / 10000 = 5e-5 kg·m² → reflected load now equals the rotor inertia: inertia ratio ≈ 1:1, easy to control. Motor inertia reflected to joint = Jm × N² = 5e-5 × 10000 = 0.5 kg·m² → the rotor now contributes as much "apparent mass" at the joint as the link itself. ``` This is why a high-ratio joint feels rigid and is easy to servo to a position: the controller mostly sees the motor's own well-behaved rotor, not the messy, varying link inertia. It's also why a high-ratio joint is a terrible force sensor — the *N²* rotor reflection sits between you and the outside world. We'll come back to that tension in the [backdrivability](#backdrivability) section, because it's the single most important conceptual fork in robot drivetrain design. > **Rule of thumb:** Aim for a reflected inertia ratio (load:motor) between roughly 1:1 and 10:1 for crisp, well-damped servo response. Far above 10:1 and tuning gets twitchy; far below 1:1 and you're hauling a motor that's oversized for the job. ### Torque multiplication and the speed match The other half is mundane but unforgiving. Motors are efficient and light *for a given power*, and power is torque × speed. Making torque the cheap way means making speed and gearing it down. A 100 W motor at 5,000 rpm produces ~0.19 N·m; gear it 100:1 at 85% efficiency and you get ~16 N·m at 50 rpm. Try to make that 16 N·m directly and you need a motor several times heavier. Gear reduction is, fundamentally, how you buy joint torque by the kilogram instead of by the dozen kilograms. See the [servo motors guide](/posts/servo-motors-ultimate-guide/) for how the motor side of this equation is sized. ## The metrics that actually matter Before comparing technologies, lock down vocabulary. These terms get used loosely and that's where selection mistakes start. | Metric | What it means | Why it matters | Typical units | |---|---|---|---| | **Ratio (N)** | Output revolutions per input revolution, inverted | Sets torque gain, speed, reflected inertia | e.g. 100:1 | | **Backlash** | Angular free play at output with input held | Lost positioning at motion reversal; limits repeatability | arc-min (1 arc-min = 1/60°) | | **Lost motion** | Total output deflection under a small specified torque, including backlash + elastic windup | The "real" reversal error you measure | arc-min | | **Hysteresis** | The width of the torque–deflection loop | Energy lost and error on load reversal | arc-min @ torque | | **Torsional stiffness** | Output torque per unit elastic twist | Path accuracy, natural frequency, vibration | N·m/arc-min or kN·m/rad | | **Efficiency (η)** | Output power / input power | Heat, battery life, required motor size | % at rated load/speed | | **Rated (continuous) torque** | Torque sustainable for L10 life at rated speed | Sizing for the steady duty cycle | N·m | | **Repeated peak torque** | Allowed during accel/decel, limited cycles | Sizing for motion peaks | N·m | | **Momentary peak / shock torque** | Survivable for a few cycles (e-stop, collision) | Sizing for the worst case | N·m, often 2–5× rated | | **Backdrivability** | Ease of driving the output to move the input | Force control, safety, energy regen | qualitative / N·m to backdrive | A few notes that separate spec-sheet readers from spec-sheet users: **Backlash is not lost motion.** Backlash is the dead zone with essentially zero torque. Lost motion is what you actually feel when you reverse direction under a working torque, and it includes elastic windup. A harmonic drive can advertise "zero backlash" and still show 0.5–2 arc-min of lost motion because the flexspline twists elastically. For a closed-loop trajectory, lost motion and stiffness matter more than the backlash number on the cover. **Stiffness sets your bandwidth.** The gearbox is a torsional spring between motor and link. Its stiffness, combined with the link inertia, sets a resonance — often in the 10–80 Hz range for robot joints — that caps your usable control bandwidth. A compliant gearbox doesn't just sag under load; it limits how aggressively you can servo before you ring. **Three torque numbers, not one.** Rated, repeated peak, and momentary peak are different physical limits — wear/fatigue, gear-tooth/lubrication, and structural respectively. The most common sizing error in robotics is picking on rated torque and getting destroyed by the momentary peak during a crash or e-stop. This is exactly where cycloidal earns its keep. ## Spur and planetary gearboxes The planetary gearbox is the workhorse and the default. If you don't have a specific reason to use harmonic or cycloidal, you're probably using planetary, and that's usually the right call. ### How a planetary stage works A planetary (epicyclic) stage has a central **sun gear** (the input), several **planet gears** carried on a **carrier**, and an outer **ring gear** (internal teeth). Hold the ring fixed, drive the sun, take output from the carrier, and the ratio is: ``` N = 1 + (ring teeth / sun teeth) Example: ring = 72 teeth, sun = 18 teeth N = 1 + 72/18 = 1 + 4 = 5:1 ``` Practical single-stage ratios run **3:1 to about 10:1**. Below 3:1 the sun gets too big relative to the ring; above ~10:1 the sun gets so small it's fragile and the planets crowd. To go higher you stack stages: a two-stage gets you ~9:1 to 100:1, three-stage up to a few hundred:1. Each stage costs you efficiency (~2–3% per stage) and adds backlash, mass, and length. The reason planetary dominates by volume: load is **shared across multiple planets** (typically 3, sometimes 4–5), so torque density is high and the input/output are coaxial. They're made by the millions, so they're cheap and available in every size. ### Backlash classes Planetary backlash is a purchasing decision, not a fixed property. Vendors sell grades: - **Economy / standard:** 10–30+ arc-min. Fine for conveyors, AGV traction, anything position-loop-corrected. - **Reduced backlash:** 3–8 arc-min. General robotics and automation. - **Precision / low-backlash:** 1–3 arc-min, sometimes <1 arc-min with preload. - **Zero-backlash:** achieved via split/preloaded gears or flexible elements, at real cost and some efficiency penalty. Real products to anchor this: **Neugart** (PLE/PLN economy through their precision lines), **Apex Dynamics** (AB/AE/AF series, popular for value), **Wittenstein alpha** (TP/SP/NP — premium, down to ~1 arc-min and below), and **Maxon GP** gearheads matched to their motors for compact mechatronic packages. For a small servo joint that needs to be cheap and reasonably tight, an Apex or Neugart precision planetary at 3 arc-min is often the pragmatic answer over a harmonic drive costing several times more. > **When to choose planetary:** cost-sensitive joints, traction/wheel drives, applications where 1–6 arc-min is good enough, and — importantly — anywhere you want decent backdrivability and don't need a huge single-stage ratio. The thing planetary *can't* easily do is give you 100:1 in one short, light, zero-backlash package. For that you go strain-wave. ## Harmonic / strain-wave drives The harmonic drive (strain-wave gear) is the piece of mechanical cleverness that made compact, precise robot arms possible. Invented by C. Walton Musser in the 1950s and commercialized by what became **Harmonic Drive LLC / Harmonic Drive SE**, it does something the others can't: a single coaxial stage of 30:1 to 160:1 with essentially zero backlash, in a thin pancake form factor. ### The three parts 1. **Wave generator** — an elliptical steel cam wrapped in a thin, flexible ball bearing. This is the input, on the motor shaft. 2. **Flexspline** — a thin-walled, cup- or hat-shaped flexible steel cylinder with external teeth. It's deformed into an ellipse by the wave generator. This is usually the output. 3. **Circular spline** — a rigid internal ring gear with *two more teeth* than the flexspline. Usually fixed to the housing. Here's the trick. The elliptical wave generator pushes the flexspline's teeth into mesh with the circular spline at the two ends of the ellipse's major axis. Because the flexspline has **two fewer teeth** than the circular spline, every full rotation of the wave generator advances the flexspline by exactly two teeth *backward* relative to the circular spline. Spin the input once; the output creeps by two teeth. ``` N = flexspline teeth / (circular spline teeth − flexspline teeth) = flexspline teeth / 2 (since the difference is 2) Example: flexspline = 200 teeth, circular spline = 202 teeth N = 200 / 2 = 100:1 ``` That's how you get 100:1 from one stage in a part you can hold in your palm. And because many teeth (often 15–30% of the total) are engaged simultaneously at any instant, the load sharing is enormous — that's the source of both the high torque density and the zero backlash. There's no clearance to take up; the teeth are continuously, elastically preloaded into engagement. ### Why "zero backlash" but not "zero lost motion" The flexspline is, by design, a spring. Apply torque and it winds up elastically before the output moves — that's the lost motion and hysteresis you see on the datasheet (typically specified as an arc-min figure at a given % of rated torque, e.g. 0.5–1.5 arc-min). For positioning that's superb. For high-bandwidth force control through the gearbox it's a limitation, because the compliance is in series with everything you're trying to control. ### Flexspline fatigue is the life-limiter The flexspline flexes from circular to elliptical and back **twice per input revolution**. At a few thousand input rpm that's millions of fatigue cycles per hour. Strain-wave life is governed by: - **Average load torque** over the duty cycle (used to compute rated-life hours), and - **Momentary peak torque** — exceed the momentary peak rating (often ~2–3.5× rated) and you can plastically deform or ratchet (tooth jump) the flexspline, or crack it outright. A flexspline that's been ratcheted even once should be treated as suspect. This is the harmonic drive's Achilles' heel relative to cycloidal: it's a thin steel cup under cyclic strain, so shock-load margin is comparatively modest. ### Why every cobot and industrial wrist uses them The combination — high ratio, low mass, zero backlash, hollow-bore options for cable routing, thin axial length, coaxial — is exactly what a robot wrist and forearm want. Universal Robots, Franka, Kuka's lighter joints, and essentially every [collaborative robot](/posts/collaborative-robots-cobots-ultimate-guide/) on the market use strain-wave gears in their distal joints. Harmonic Drive's own integrated **FHA/SHA** actuators (motor + strain-wave + encoder + brake in one housing) are a default building block for arm and [humanoid](/posts/humanoid-robot-hardware-ultimate-guide/) designers. Sumitomo's **Fine Cyclo** and a handful of others compete, but Harmonic Drive's name is on the category for a reason. ## Cycloidal drives If the harmonic drive is the precision specialist, the cycloidal drive is the heavyweight. Where strain-wave gears flex a thin steel cup, cycloidal drives roll a thick steel disc against a ring of pins — and that structural robustness is the whole point. ### How a cycloidal stage works 1. An **input shaft with an eccentric cam** wobbles a **cycloidal disc** (a disc with a lobed, cycloidal profile) in a small orbit. 2. The disc's lobes roll against a ring of **fixed pins/rollers** in the housing. The disc has **one fewer lobe** than there are pins. 3. As the cam orbits once, the disc rotates backward by one lobe. **Output pins** (or rollers through holes in the disc) pick off that slow rotation and deliver it to the output shaft. ``` Single cycloidal stage: N = number of pins / (pins − disc lobes) ≈ number of lobes for a one-lobe difference Example: 40 pins, disc with 39 lobes N = 39 / (40 − 39) = 39:1 (commonly quoted as the lobe count) ``` Most discs run two cycloidal stages 180° out of phase to balance the orbiting mass and reduce vibration. The **RV-type** ("Rotary Vector") drive — pioneered and dominated by **Nabtesco** — adds a planetary input stage in front of the cycloidal stage, giving very high overall ratios (commonly **30:1 to 200:1+**) with excellent stiffness and shock tolerance. ### Why RV-type dominates heavy industrial axes Three properties make cycloidal the right answer for the proximal axes of payload arms: - **Shock-load capacity.** Because torque is carried by many pins/rollers in compression against a thick disc, momentary overload ratings are typically **~5× rated torque**. When a 50 kg payload hits an e-stop, that margin is the difference between a scuffed disc and a destroyed gearbox. - **Torsional stiffness and moment rigidity.** RV units integrate large main bearings (often cross-roller) that take big tilting moments directly, so they hold the arm's geometry under load. Stiffness runs high — useful when the gearbox is also the structural joint. - **Low, stable lost motion.** ~1 arc-min, and it stays low over life because there's no thin flexing element to fatigue the same way. The tradeoff is mass and cost: an RV unit for a robot elbow is a dense chunk of steel, heavier than a harmonic of similar ratio, and it carries some ripple/vibration from the eccentric motion. That's fine on the base, shoulder, and elbow where you've got the structure anyway and where shock and stiffness rule — exactly the axes detailed in the [industrial robot arms guide](/posts/industrial-robot-arms-ultimate-guide/). It's the wrong choice out at the wrist where every gram costs you payload. Real products: **Nabtesco RV** (the de-facto standard — RV-E, RV-N, and component sets used by FANUC, ABB, Yaskawa, Kuka in their bigger arms), **Spinea TwinSpin** (cycloidal with integrated bearing, popular where compactness and rigidity both matter), and **Sumitomo Cyclo** (the original cyclo gearing, broad industrial range). ## Head-to-head: harmonic vs cycloidal vs planetary Numbers are representative of robotics-grade units in the small-to-medium size range; specific products vary, so treat these as the shape of the tradeoff, not gospel. | Property | Planetary (precision) | Harmonic / strain-wave | Cycloidal RV | |---|---|---|---| | **Single-stage ratio** | 3:1 – 10:1 | 30:1 – 160:1 | 30:1 – 200:1+ (RV w/ input stage) | | **Backlash** | 1 – 6 arc-min (≤1 preloaded) | ~zero (no clearance) | ~1 arc-min | | **Lost motion** | 1 – 6 arc-min | 0.5 – 1.5 arc-min | ~1 arc-min | | **Torsional stiffness** | Moderate–high | Moderate (small) to high (large) | High | | **Efficiency (rated)** | 80 – 93% (1–2 stage) | 70 – 90% | 80 – 93% | | **Efficiency at low load/cold** | Holds up well | Drops sharply (can be <50%) | Moderate drop | | **Momentary peak / shock** | ~2–3× rated | ~2–3.5× rated (ratchet risk) | **~5× rated** | | **Mass for given ratio/torque** | Low–moderate | **Low** | High | | **Axial length** | Long (stacked stages) | **Short (pancake)** | Moderate | | **Backdrivability** | Good (low ratio) | Poor (high ratio + friction) | Poor | | **Vibration / smoothness** | Good | Very smooth | Some ripple from eccentric | | **Relative cost** | $ | $$$ | $$$ | | **Best home** | Wheels, cheap joints, force-control (QDD) | Wrists, forearms, cobots, humanoids | Base/shoulder/elbow of payload arms | The one-line summary engineers should internalize: > **Planetary for cost and backdrivability; harmonic for ratio, precision and low mass; cycloidal for shock and stiffness.** Most real arms use all three — cycloidal at the base, harmonic at the wrist, sometimes planetary in a gripper or a low-ratio shoulder. ## Backlash and how to fight it Backlash is the angular free play that lets a meshing gear pair reverse direction slightly before the driven gear responds. In an open-loop system it's positioning error you can't recover. In a closed-loop system with a load-side encoder you can correct *position*, but you still get a velocity glitch and impulsive contact at every reversal — bad for surface finish in machining, bad for vibration, bad for gear life. ### Where backlash comes from You need a small clearance for lubrication and thermal expansion, so spur and planetary gears are built with it on purpose. Wear widens it over life. Stack three planetary stages and the backlash adds up across stages. Harmonic and cycloidal drives sidestep this by preloading the mesh (strain-wave's continuous tooth engagement, cycloidal's many-pin contact), which is precisely why they're "zero/low backlash." ### Techniques to reduce it in geared drives - **Anti-backlash gears.** Split a gear into two halves with a spring between them so each half loads opposite tooth flanks. Cheap, common, but the spring limits torque and adds drag. - **Preloaded planetary.** Vendors grind and select gears, then preload, to hit <1 arc-min. You pay for it in price and a little efficiency. - **Dual-motor electronic preload (master/slave).** Drive one output through two motors/gear trains and command them with a small opposing bias torque so the mesh is always loaded on one side. Used on machine-tool rotary tables and some high-end robot axes. Effective, but doubles the drive hardware and needs careful control. - **Pick a zero-backlash topology.** Often the cheapest path to "no backlash" is simply choosing harmonic or cycloidal rather than fighting a planetary. > **The cost of zero backlash:** every gram of backlash you remove costs money, efficiency, or both. Don't buy 1 arc-min where 6 arc-min and a load-side encoder will do. Spend the precision budget on the axes that actually set the tool point. ## Backdrivability and the gear-ratio tradeoff This is the most important conceptual decision in robot drivetrains, and it's a genuine fork: **you cannot have a high ratio and good backdrivability at the same time.** ### The physics of why high ratio kills transparency Two effects gang up as ratio rises: 1. **Reflected inertia scales with N².** From the [output side](#why-reduction), the rotor's apparent inertia at the joint is `Jm × N²`. At N=100 a tiny rotor feels like a heavy flywheel attached to the joint. Pushing the output has to accelerate that apparent mass. 2. **Friction is amplified and gearing is non-reciprocal.** Friction torque referred to the output grows with ratio, and high-ratio drives (especially strain-wave with its many simultaneously-meshing teeth) have enough friction that the output simply won't backdrive the input under reasonable force. A worm gear is the extreme case (self-locking); strain-wave at 100:1 is close in spirit. So a 100:1 harmonic joint is *opaque*: you can't feel external forces through it without a torque sensor, and you can't gently push the arm by hand. That's great for holding a position rigidly with low motor current; it's bad for force control and for inherent safety. ### Low ratio for force control: the QDD philosophy The legged-robotics and force-control crowd went the other way. A **quasi-direct-drive (QDD)** actuator pairs a large, low-Kv "pancake" motor with a *single* low-ratio planetary stage, typically **6:1 to 10:1**. Why: - Reflected rotor inertia stays low (`Jm × N²` with small N), so the output is **transparent** — you can sense and control force by measuring motor current alone, no torque sensor needed. - It **backdrives freely**, so the leg can absorb impacts (a robot landing from a jump) and you can do impedance control with high fidelity. - It's **robust to shock** because there's little gearing to break and the big motor takes the hit. This is the architecture behind MIT Cheetah-lineage actuators and most modern quadrupeds and dynamic bipeds — see the [legged / quadruped hardware guide](/posts/legged-quadruped-robot-hardware-ultimate-guide/) and the broader [robot actuators guide](/posts/robot-actuators-ultimate-guide/) for the full actuator-level treatment. The price is torque density: a QDD makes its torque mostly from a big, heavy motor rather than from gearing, so it's bulkier and draws more current to hold static loads. > **The fork:** High ratio (harmonic, RV) → stiff, precise, low holding current, opaque, fragile to shock. Low ratio (QDD planetary) → transparent, backdrivable, shock-tolerant, but heavier per N·m and worse at holding static loads efficiently. A surgical arm and a parkour quadruped sit at opposite ends, and they're right to. Some designers split the difference with a **mid-ratio (15:1–25:1) drive plus a series-elastic or load-side torque sensor**, getting most of the precision while measuring force directly. That's a legitimate third path, common in humanoid hips and knees. ## Efficiency, heat and lubrication Efficiency is where datasheet optimism meets the battery, and it's badly underspecified in casual selection. ### Efficiency is a function of load, speed, and temperature — not a single number The "85%" on the cover is at rated torque, rated speed, warm. Real robot duty cycles spend a lot of time at low load, and that's where it falls apart, especially for harmonic drives: - A harmonic drive at **20% of rated torque** can sit at **50–65%** efficiency even when warm; cold, it's worse. - At **0 °C startup**, lubricant viscosity spikes and a strain-wave's no-load running torque can multiply, dragging efficiency down further until it warms up. If you're sizing a cold-start outdoor robot, derate accordingly. - Higher ratios are less efficient: a 30:1 harmonic might be ~85% at rated, a 160:1 closer to ~70%. Planetary holds efficiency better across the load range (fewer, simpler losses), and cycloidal sits in the middle-to-good band. ### Heat: the losses have to go somewhere `Heat = input power × (1 − η)`. A joint pushing 200 W through an 80% gearbox dumps 40 W into the gearbox housing. In a sealed, lubricated drive with limited surface area, that raises temperature, thins the lube, and can drive you toward a thermal duty-cycle limit *before* you hit a torque limit. For continuously-loaded joints, check the thermal rating, not just the torque rating. ### Lubrication - **Grease** for most robotics: sealed, low maintenance, good for the typical intermittent duty. Watch the temperature rating and the relube interval (often tens of thousands of hours, but it exists). - **Oil** for high-speed, high-duty, or high-heat applications (some industrial RV setups), with the plumbing and sealing that implies. - **Grease migration and seal life** are real failure paths. A harmonic drive that loses grease from the wave-generator bearing wears fast. > **Battery-robot rule:** model gearbox efficiency at your *actual* operating point (load %, speed, temperature), not at the rated point. The difference between 85% and 60% across a duty cycle is a meaningful chunk of your runtime. ## Sizing and selecting a gearbox A defensible selection is a short engineering procedure, not a catalog glance. Here's the order that catches the mistakes. ### 1. Define the joint requirements - **Continuous (RMS) output torque** over the duty cycle. - **Repeated peak torque** during acceleration/deceleration, and how many cycles. - **Momentary peak torque** — the worst case: collision, e-stop, payload drop. This is often the sizing driver and the one people skip. - **Output speed** range and the **average input speed** (needed for harmonic/cycloidal life). - **Required backlash / lost motion** and **stiffness** for your accuracy and bandwidth targets. - **Moment and axial/radial loads** at the output (does the gearbox bearing carry the joint, or is there a separate bearing?). ### 2. Choose the ratio Ratio is a system optimization, not a free choice. It couples the motor and the gearbox: ``` Pick N to: - reach joint torque: N ≥ T_joint / (T_motor,cont × η) - keep motor in its sweet spot: motor speed = N × joint speed → should land near rated rpm - get a sane reflected inertia ratio: Jl / (N² × Jm) ≈ 1–10 - leave headroom for peak torque without ratcheting ``` These pull against each other. Higher N gives torque and a nice inertia ratio but kills backdrivability and efficiency and runs the input faster (more flexspline fatigue cycles). The right N is a negotiated settlement between the motor's [torque-speed curve](/posts/servo-motors-ultimate-guide/) and the joint's needs. ### 3. Check life (L10 and fatigue) Bearings and gears have a statistical life. For planetary and cycloidal, the bearing **L10** life (10% failure probability) scales roughly with `(C/P)^p × speed` terms. For harmonic drives, the manufacturer gives a **rated life in hours** computed from your *average* load torque and *average* input speed — you compute an equivalent cubic-mean torque over the duty cycle and read life off the curve. Undersize here and the drive simply wears out early; it won't fail on day one, which makes this error easy to ship. ### 4. Verify the peaks and the thermal limit Confirm momentary peak torque ≤ the gearbox's momentary rating (with margin — 1.5–2× is sane for collision-prone robots), repeated peak ≤ the repeated rating, and that the average power loss doesn't exceed the thermal rating at your ambient. ### 5. Mounting and integration Hollow bore for cable routing? Output flange and bolt pattern? Does the gearbox provide the main joint bearing (RV and many integrated harmonic units do) or do you add one? Integrated actuators (Harmonic Drive FHA/SHA, Nabtesco gear+motor sets) save you the alignment and tolerancing grief at a price. > **Sizing sanity check:** if your selection is driven only by continuous torque, you probably under-sized for shock. If it's driven only by shock, you may have over-sized for the duty cycle and you're hauling dead mass. Find the binding constraint, then check the others didn't quietly bind too. ## Where each gearbox shows up Mapping technology to application is the payoff of all the above. | Application | Joint / location | Typical gearbox | Why | |---|---|---|---| | **Cobot** ([cobots guide](/posts/collaborative-robots-cobots-ultimate-guide/)) | All joints, esp. wrist/forearm | Harmonic (often integrated FHA/SHA) | Zero backlash, low mass, hollow bore, thin — and torque sensing added externally for safety | | **Industrial payload arm** ([arms guide](/posts/industrial-robot-arms-ultimate-guide/)) | Base, shoulder, elbow (J1–J3) | Cycloidal RV (Nabtesco) | Shock tolerance (~5×), high stiffness/moment capacity, holds geometry under big loads | | **Industrial payload arm** | Wrist (J4–J6) | Harmonic | Compact, light, precise where payload margin is tight | | **Humanoid** ([humanoid guide](/posts/humanoid-robot-hardware-ultimate-guide/)) | Hip / knee (dynamic) | Low/mid-ratio planetary (QDD) or RV, + torque sensing | Backdrivability and shock for dynamic motion; some use compact harmonic for arms | | **Humanoid** | Wrist / fingers | Harmonic or small planetary | Precision and packaging | | **Quadruped** ([legged guide](/posts/legged-quadruped-robot-hardware-ultimate-guide/)) | Hip/knee | QDD planetary (6:1–10:1) | Transparency for impedance control, impact absorption, robustness | | **AGV / AMR** ([mobile robots guide](/posts/mobile-robots-amr-agv-ultimate-guide/)) | Drive wheels | Planetary hub / wheel drives | Cost, robustness, ratio for traction; backlash irrelevant | | **Surgical / metrology arm** | All joints | Harmonic | Zero backlash and smoothness dominate | The pattern is consistent: **precision and low mass distally, shock and stiffness proximally, transparency where you control force.** A well-designed arm is not one gearbox technology — it's the right one at each joint. ## Failure modes, wear and maintenance Gearboxes rarely fail suddenly out of nowhere; they tell you first if you're listening. ### Common failure modes by type **Planetary** - *Backlash growth from tooth wear* — the slow, normal end of life. Shows up as degraded repeatability. - *Bearing wear / pitting* — increased noise and vibration, eventually play. - *Tooth fracture* from a shock load beyond the momentary rating — sudden, catastrophic. **Harmonic / strain-wave** - *Flexspline fatigue crack* — the dominant end-of-life mode, from accumulated flex cycles or an overload event. Appears at the tooth root or the diaphragm/cup transition. - *Tooth jumping / ratcheting* under momentary overload — instantly damages the mesh and the flexspline; the drive may run but with degraded accuracy and a shortened life. (Distinct from a *dedoidal* condition — an improper, eccentric tooth mesh from misalignment or assembly error — which also drives vibration and early flexspline failure.) - *Wave-generator bearing failure* — loss of grease or contamination; raises running torque and accelerates everything else. **Cycloidal RV** - *Surface wear/pitting on pins, rollers, and the disc* — gradual, raises lost motion and noise. - *Eccentric bearing wear* — vibration and lost motion increase. - *Main bearing wear* — joint develops play/tilt; matters because the gearbox is structural. - Generally the most forgiving of the three under abuse, by design. ### Maintenance and condition monitoring - **Relube on schedule.** Grease degrades and migrates; the relube/refill interval is a real number in the manual, not optional. - **Trend the symptoms.** Rising no-load running torque, rising motor current to hold position, increased acoustic noise, growing positioning error after reversal (lost motion), and rising operating temperature are all early warnings. On instrumented robots, log motor current and joint following-error and watch the trend. - **Respect the overload history.** A drive that has taken a hard collision should be inspected or flagged even if it still runs — especially a harmonic flexspline, which can be cracked but functional. - **Seal integrity.** Contamination ingress kills gearboxes; a failing seal is an upstream cause of multiple downstream failures. > **Maintenance rule:** the cheapest gearbox failure is the one you catch as a trend. Instrument current and following-error, set thresholds, and replace on data — not on a fixed calendar that's either wastefully early or dangerously late. ## Frequently asked questions **What's the real difference between backlash and lost motion?** Backlash is the angular free play with essentially zero torque applied — a dead band. Lost motion is the total output deflection under a small *specified* torque, and it includes both backlash and elastic windup. A harmonic drive can have "zero backlash" yet 0.5–1.5 arc-min of lost motion because the flexspline twists elastically. For closed-loop trajectory accuracy, lost motion and stiffness matter more than the headline backlash number. **Why do collaborative robots almost always use harmonic drives?** Because the cobot wrist and forearm need high ratio, zero backlash, low mass, a hollow bore for cabling, and a thin axial package — and strain-wave is the only technology that delivers all five in one stage. Safety force-limiting is then layered on with a torque sensor or by estimating torque, since the high-ratio drive itself isn't backdrivable. See the [cobots guide](/posts/collaborative-robots-cobots-ultimate-guide/). **Why do big industrial arms use cycloidal (RV) drives at the base and shoulder?** Shock tolerance and stiffness. RV drives carry torque through many pins in compression and integrate large moment-bearing main bearings, so they survive momentary overloads around 5× rated and hold the arm's geometry under heavy payloads. That's exactly what the proximal axes of a payload arm need; the wrist gets harmonic instead. More in the [industrial arms guide](/posts/industrial-robot-arms-ultimate-guide/). **Can I backdrive a harmonic drive?** Practically, no, not at high ratios. Reflected rotor inertia scales with N² and the many-tooth mesh has enough friction that the output won't drive the input under reasonable force. That's why high-ratio harmonic joints need a torque sensor for force control. If you need backdrivability, use a low-ratio planetary / QDD architecture instead. **What is a quasi-direct-drive (QDD) actuator and when should I use it?** A QDD pairs a large, low-Kv pancake motor with a single low-ratio (≈6:1–10:1) planetary stage. The low ratio keeps reflected inertia and friction small, so the output is transparent and backdrivable — ideal for force/impedance control and impact absorption in legged robots. The cost is torque density: you make torque with a big heavy motor instead of gearing. See the [legged hardware](/posts/legged-quadruped-robot-hardware-ultimate-guide/) and [actuators](/posts/robot-actuators-ultimate-guide/) guides. **How do I pick a gear ratio?** Balance four things: enough torque (`N ≥ T_joint / (T_motor × η)`), keeping the motor near its rated rpm (`motor rpm = N × joint rpm`), a sane reflected inertia ratio (`Jl/(N²·Jm) ≈ 1–10`), and headroom for peak torque. They conflict — higher N helps torque and inertia ratio but hurts efficiency and backdrivability and adds fatigue cycles. The answer is the negotiated middle, read against the motor's torque-speed curve. **Why does my harmonic drive feel inefficient on cold mornings?** Cold lubricant is much more viscous, which spikes the no-load running torque of a strain-wave drive. Combined with the fact that harmonic efficiency already drops steeply at low load, a cold drive at light load can dip well under 50% efficiency until it warms up. Size and budget battery for the cold-start operating point if you run outdoors. **Which gearbox handles shock loads best?** Cycloidal RV, decisively. Momentary overload ratings around 5× rated are typical because load is shared across many pins/rollers against a thick steel disc. Planetary tooth fracture and harmonic flexspline ratcheting both happen at lower multiples (~2–3.5×). If your robot collides or e-stops with significant payload inertia, that shock rating — not the continuous torque — is often the real sizing constraint. **Is zero backlash always worth paying for?** No. Zero backlash costs money, often costs efficiency, and is wasted if a load-side encoder can correct the position error. Spend the precision budget on the axes that actually set the tool point, and accept 3–6 arc-min planetary backlash elsewhere. Buying 1 arc-min everywhere is a common, expensive mistake. **How long do robot gearboxes last?** Harmonic drives are rated in hours computed from your average load torque and input speed — commonly several thousand to tens of thousands of hours of actual operation depending on duty. Planetary and cycloidal are governed by bearing L10 and gear fatigue. All of them last longer if you stay within the momentary peak ratings, keep them lubricated, and avoid contamination. A single hard overload can quietly halve the remaining life. **Do I need a separate joint bearing, or does the gearbox provide it?** Depends on the unit. Cycloidal RV drives and many integrated harmonic actuators include a large output bearing rated for the joint's moment and axial/radial loads, so they *are* the structural joint. Bare planetary gearheads and bare harmonic component sets usually do not — you must add a cross-roller or similar bearing to carry the link loads, or you'll overload the gearbox internals. **What about planetary for a robot — is it ever the precision choice?** Yes, for cost-sensitive joints, wheel/traction drives, and anywhere 1–6 arc-min is adequate (most positions, when closed-loop). Preloaded precision planetary from Wittenstein alpha, Neugart, or Apex Dynamics can reach ≤1 arc-min if you genuinely need it. Planetary is also the right base for QDD force-control actuators because of its good backdrivability at low ratio. ## Changelog - **2026-06-11** — Initial publication. --- # Robot Simulation & Digital Twins: The Ultimate Guide URL: https://blog.robo2u.com/posts/robot-simulation-digital-twin-ultimate-guide/ Published: 2026-06-10 Updated: 2026-06-20 Tags: robot-simulation, digital-twin, gazebo, isaac-sim, mujoco, sim-to-real, physics-engine, domain-randomization, guide Reading time: 39 min > A working roboticist's deep guide to robot simulation and digital twins in 2026: physics engines, Gazebo vs Isaac Sim vs MuJoCo vs PyBullet, GPU-parallel sim, sensor models, the reality gap, sim-to-real, and how to choose. Every robot you have ever shipped was simulated first, whether you admit it or not. The cheap version is a spreadsheet of torque-speed curves and a back-of-the-envelope battery estimate. The expensive version is a multi-body dynamics engine running a contact solver at 1 kHz, feeding synthetic lidar returns and camera frames into the exact same ROS 2 stack that will run on the robot. The gap between those two is the subject of this guide. This is about **robot simulation** — modeling a robot and its environment in software well enough to design, test, and *train* on it — and its overhyped cousin, the **digital twin**. We will start from why you simulate at all, go down into the physics engines (rigid-body dynamics, the contact problem, solvers, timestep), compare the simulators engineers actually run (Gazebo, NVIDIA Isaac Sim and Isaac Lab, MuJoCo, PyBullet, Webots, CoppeliaSim), look at fidelity-versus-speed and the real-time factor, work through sensor and rendering simulation, then the thing that changed robot learning — **GPU-accelerated massively-parallel sim** — and finally the hard part: the **reality gap**, sim-to-real, what a digital twin actually is versus what the marketing says, and when the simulator is quietly lying to you. **The take**: in 2026 simulation is not optional and it is not one tool. You will run *at least two* simulators — a high-throughput GPU sim (Isaac Lab or MuJoCo) to **train** policies on millions of trajectories, and a higher-fidelity, ROS-native sim (Gazebo or Isaac Sim) to **integrate and regression-test** the full software stack before it touches hardware. The single biggest source of sim-to-real failure is not the renderer and not the robot model; it is **contact and friction**, because that is the one part of the physics every engine approximates differently and none gets exactly right. Spend your fidelity budget there. And stop calling an offline simulation a "digital twin" — a twin is *synchronized with a real asset in real time*, and if yours is not, it is just a sim with a nicer dashboard. Companion reading: [reinforcement learning for robotics](/posts/reinforcement-learning-robotics-ultimate-guide/), [motion planning & kinematics](/posts/motion-planning-kinematics-ultimate-guide/), [ROS 2](/posts/ros2-ultimate-guide/), [legged & quadruped robot hardware](/posts/legged-quadruped-robot-hardware-ultimate-guide/), [humanoid robot hardware](/posts/humanoid-robot-hardware-ultimate-guide/), and [LiDAR & depth cameras](/posts/lidar-depth-cameras-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [Why simulate at all](#why-simulate) 3. [Physics engines: rigid-body dynamics](#physics-engines) 4. [The contact problem (why sims disagree)](#contact) 5. [The major simulators compared](#simulators) 6. [Fidelity vs speed and the real-time factor](#fidelity-speed) 7. [Rendering and sensor simulation](#sensors) 8. [GPU-accelerated massively-parallel sim](#gpu-parallel) 9. [The reality gap and sim-to-real](#sim-to-real) 10. [Digital twins: what the word actually means](#digital-twins) 11. [When the simulation lies](#sim-lies) 12. [Validation and CI in simulation](#validation) 13. [Selecting a simulation stack](#selecting) 14. [Frequently asked questions](#faq) ## Key takeaways - **Simulation buys you cost, safety, scale, data, and regression coverage.** A crash costs a render frame, not a 40 kg robot. You can run 4,096 environments in parallel, generate labeled data for free, and re-run the same nightly test suite forever. That portfolio is why every serious robotics program simulates. - **The physics engine is the heart, and the contact model is the heart of the physics engine.** Rigid-body dynamics is well understood and engines mostly agree on free-flight motion. They disagree — sometimes wildly — the instant bodies touch, because contact and friction are non-smooth, constraint-based, and solved approximately. - **Timestep and solver choice dominate stability and accuracy.** A stiff contact at a 10 ms step explodes; the same contact at 1 ms (or with an implicit solver and soft constraints) behaves. Smaller steps cost linearly in compute. There is no free fidelity. - **The big simulators split by job.** Gazebo (Harmonic/Ionic) is the ROS-native integration sim; Isaac Sim is the high-fidelity rendering + PhysX sim; Isaac Lab and MuJoCo are the GPU/learning workhorses; PyBullet is the fast, hackable research default; Webots and CoppeliaSim are batteries-included all-rounders. - **Real-time factor (RTF) is the number to watch.** `RTF = sim_time / wall_time`. RTF > 1 means faster than reality; < 1 means slower. A high-fidelity contact-heavy scene can drop below 0.1 RTF on a CPU; a GPU parallel sim can hit *thousands* of times real-time in aggregate. - **GPU massively-parallel sim changed robot learning.** Running thousands of environments on one GPU (Isaac Lab, MuJoCo MJX/Playground) collapsed quadruped and manipulation training from weeks on CPU clusters to hours on one workstation. That is the reason 2020s legged robots learned to walk in sim. - **Sensor simulation is a separate fidelity axis from dynamics.** Cameras (rasterized or ray-traced), depth, lidar (ray-cast with intensity/dropout), IMU (bias + noise), and contact sensors each need their own noise models. A perfect dynamics sim with noise-free sensors still won't transfer. - **The reality gap is the difference between sim and reality, and it is mostly unmodeled dynamics.** Friction, actuator lag, backlash, compliance, sensor latency, and contact stiffness are where it lives. You close it with **system identification**, **domain randomization**, and **domain adaptation** — usually all three. - **Domain randomization works because it makes reality look like one more random sample.** Randomize masses, frictions, latencies, textures, and lighting widely enough and the real world falls inside the training distribution. It trades peak sim performance for robustness, and that trade is almost always correct. - **A digital twin is *synchronized with a real asset*, not just a model of one.** The defining feature is a live data link from the physical robot/cell to the model. An offline simulation, however detailed, is a sim. Most "digital twin" products are sims with telemetry dashboards. - **Simulators lie about contact, deformables, friction, and sensor artifacts.** Rigid-body engines fake deformation, friction cones are linearized, glass/IR interactions are skipped, and rolling shutter is often ignored. Know which lies your sim tells before you trust a result. - **CI in simulation is the highest-leverage practice most teams skip.** Headless sim in a container, deterministic seeds, scripted scenarios, pass/fail metrics — run it on every merge. It catches the regression that would otherwise be found by a robot driving into a wall. - **Pick by job, not by hype.** Need ROS integration testing? Gazebo. Need photoreal sensors and a digital twin of a real cell? Isaac Sim. Need to train a locomotion policy this week? Isaac Lab or MuJoCo. Need a quick research prototype? PyBullet. Most real programs run two of these, not one. ## Why simulate at all Before the tools, the motivation. There are five reasons to simulate, and they are not equally important for every team. **Cost.** Robots are expensive and fragile. A 7-kg quadruped that falls off a ledge during a controller bug is a 5,000-USD repair and a week of downtime. In sim that same fall costs you a log file. The asymmetry is enormous on early-stage development where the controller *will* be buggy. **Safety.** Some failures you cannot afford to discover on hardware: a 30 kg industrial arm swinging through where a person stands, a humanoid losing balance near a workbench, a mobile robot at 2 m/s testing its emergency stop. You validate the dangerous envelope in sim first, then narrow the hardware test to the cases that passed. **Scale.** You cannot run 1,000 robots in a lab. You can run 1,000 — or 4,096, or 16,384 — simulated robots on one GPU. Scale matters for two things: statistical coverage of edge cases (run the docking maneuver 10,000 times with randomized start poses) and, more importantly, for learning. **Reinforcement-learning data.** This is the reason simulation went from "useful" to "indispensable" in the last several years. RL needs millions to billions of environment steps. You cannot collect that on hardware — it would take years and destroy the robot. GPU sim generates it in hours. See [reinforcement learning for robotics](/posts/reinforcement-learning-robotics-ultimate-guide/) for the policy side; this guide is the environment side. **Regression testing.** Once a system works, the job becomes *keeping* it working as the code changes. A simulation gives you a repeatable environment to re-run the same scenarios on every commit. This is the least glamorous reason and arguably the highest-value one for a shipping product. > **Rule of thumb:** if a test is dangerous, slow to set up, hard to repeat, or needs to run thousands of times, it belongs in simulation. If it depends on the exact physics your sim approximates worst — fine contact, deformables, real sensor noise — keep a hardware version too. What simulation does *not* do is replace hardware testing. It de-risks it, front-loads it, and amplifies it. The teams that get burned are the ones who treat a green sim run as a ship signal. Sim tells you the logic is right and the gross dynamics are plausible. Hardware tells you the truth. ## Physics engines: rigid-body dynamics A physics engine integrates the equations of motion of a system of bodies forward in time. For robots that system is almost always **articulated rigid bodies** — links connected by joints — plus contacts with the ground and objects. The core loop, every timestep `dt`: 1. Compute forces and torques (gravity, actuators, springs, external). 2. Resolve **constraints** (joints keep links connected; contacts keep bodies from interpenetrating). 3. Integrate accelerations to velocities and velocities to positions. The hard part is step 2. Joints are *equality* constraints — relatively easy. Contacts are *inequality* constraints (bodies may push apart but not pull together) plus friction (which is itself a constraint coupling normal and tangential forces). That makes the dynamics **non-smooth**: velocities jump discontinuously at impact, and the system switches between sticking and sliding. Two broad formulations: - **Maximal coordinates.** Each body has 6 degrees of freedom; joints are enforced as constraints. Simple to implement, used by ODE and Bullet historically. Drift in the joint constraints is a real issue and gets stabilized with hacks (Baumgarte stabilization, error-reduction parameters). - **Generalized (reduced) coordinates.** The system state is the joint angles directly; the kinematic tree is built in, so joints can never drift apart. MuJoCo, DART, and PhysX's articulation system use this. It is more accurate for articulated robots and is why MuJoCo feels so clean on arms and legs. The solver that resolves the constraints is where engines diverge: - **Projected Gauss-Seidel (PGS)** — iterative, fast, the classic ODE/Bullet approach. Cheap per iteration but converges slowly; under-iterated PGS makes contacts feel spongy and joints slightly loose. - **Sequential impulse** — Bullet's contact solver; impulse-based, robust, fast, the game-physics standard. - **TGS (Temporal Gauss-Seidel)** — PhysX's improved solver (sub-stepping the constraint solve), much better at stiff stacks and high mass ratios. - **Convex optimization / Newton solvers** — MuJoCo solves contact as a convex optimization problem each step, which is why it is stable at large timesteps and high stiffness where PGS would explode. Here is the comparison engineers actually need. | Engine | Coordinates | Contact solver | Strengths | Weaknesses | Used in | |---|---|---|---|---|---| | **ODE** | Maximal | PGS (LCP) | Mature, stable for simple scenes, ROS legacy | Slow, spongy contacts, dated | Gazebo (default historically) | | **Bullet** | Maximal (+ Featherstone) | Sequential impulse / PGS | Fast, broad adoption, soft-body option | Contact stiffness tuning is fiddly | PyBullet, Gazebo, Isaac (early) | | **PhysX 5** | Generalized articulations | TGS | GPU-accelerated, stiff stacks, scales | NVIDIA-centric, less transparent | Isaac Sim / Isaac Lab | | **MuJoCo** | Generalized | Convex (Newton/PGS option) | Best-in-class articulated accuracy & stability, large `dt`, soft contacts | Primitive geoms preferred, smaller sensor suite | DeepMind MuJoCo, MJX | | **DART** | Generalized | LCP / Featherstone | Accurate analytical dynamics, research-grade | Smaller community, slower | Gazebo (optional), research | > **Opinion with reason:** for *articulated-robot* dynamics — arms, legs, humanoids — MuJoCo and PhysX articulations are the right choice over ODE/Bullet, because generalized coordinates eliminate joint drift and the modern solvers stay stable at the large stiffness and mass ratios real robots have (a 0.1 kg foot pushing a 30 kg torso). ODE's age shows exactly here. The integration scheme matters too. **Explicit Euler** is cheap and unstable for stiff systems; **semi-implicit (symplectic) Euler** is the common default; **implicit / Runge-Kutta** variants buy stability at the cost of per-step compute. MuJoCo's implicit integration is a big part of why it tolerates a 5 ms step where Bullet wants 1 ms. ## The contact problem (why sims disagree) If you take one idea from this guide, take this: **simulators agree on flight and disagree on contact.** Throw a ball with no spin and every engine gives nearly the same parabola. Drop a stack of blocks, push a box across a floor, or close a gripper on a cylinder, and the engines diverge — sometimes the box slides differently, sometimes the stack topples in one engine and stands in another. Why? Three approximations that every engine makes differently. **1. Contact detection and penetration.** Engines detect contact by collision geometry, then must decide what to do about the small interpenetration that numerically always occurs. *Penalty methods* model contact as a stiff spring-damper (push proportional to penetration depth) — simple but requires tiny timesteps or it oscillates. *Constraint methods* solve for the impulse that exactly prevents penetration (an LCP or convex program) — stable but expensive and approximate when under-iterated. The choice changes how "hard" a floor feels. **2. The friction cone.** Coulomb friction says the tangential force magnitude is bounded by `μ` times the normal force, in *any* tangential direction — a cone. Solving the true cone is a nonlinear problem, so most engines **linearize** it into a pyramid (4 or 8 facets). A pyramidized cone makes friction slightly anisotropic: a box pushed at 45° behaves differently from one pushed along an axis. MuJoCo can use an elliptic (true-cone) model, which is one reason its sliding behaves better. **3. Restitution and simultaneous contacts.** Multiple contacts resolved at once (a box on a floor has 4 corners) are order-dependent in iterative solvers, so the result depends on solver iterations and ordering. Bouncing (restitution) is even less consistent across engines. The practical consequence: ```text Same robot, same gripper, same 50 mm cylinder, μ = 0.6: Engine A: grasp holds, object stays put Engine B: object slowly rotates out of the fingers Engine C: object squirts out at contact (penetration recovery impulse) None is "wrong" — they make different contact approximations. The policy you train on B may fail on hardware AND on A. ``` This is why contact-rich manipulation has the worst sim-to-real transfer of any robotics task, and why legged locomotion — which is *also* contact-rich but more forgiving because feet are points and gaits self-stabilize — transfers better than you'd expect. It is also why you should never tune a grasp controller to a single engine's contact behavior and call it done. > **Rule:** treat friction coefficients, contact stiffness, and restitution as **uncertain parameters to randomize**, not as physical constants you can measure once. The number you measure on one surface at one speed is not the number the solver wants. ## The major simulators compared Six tools cover almost the entire field. Here is the honest comparison, then notes on each. | Simulator | Physics | Rendering | GPU parallel | ROS 2 | Best at | Weakness | |---|---|---|---|---|---|---| | **Gazebo** (Harmonic/Ionic) | DART (default), Bullet, ODE | OGRE 2 (raster) | No (multi-process) | First-class | ROS integration, system testing, sensors | Not built for massive parallel RL; rendering is functional, not photoreal | | **Isaac Sim** | PhysX 5 | RTX ray-tracing | Yes | Bridge | Photoreal sensors, digital twins, USD pipelines | Heavy, NVIDIA RTX GPU required, steep setup | | **Isaac Lab** | PhysX 5 (GPU) | RTX (optional) | Yes (thousands) | Via Isaac Sim | GPU-parallel RL training at scale | Learning-focused; not a general integration sim | | **MuJoCo / MJX** | MuJoCo (CPU + GPU via MJX) | Built-in (basic) + MuJoCo-Warp | Yes (MJX/JAX) | Community | Articulated dynamics accuracy, fast RL, research | Sparse sensor/rendering suite; primitive geoms preferred | | **PyBullet** | Bullet | OpenGL / TinyRenderer | Limited | Community | Fast prototyping, free, hackable, huge tutorial base | Aging, contact tuning fiddly, no massive parallel | | **Webots** | Fork of ODE (custom) | OpenGL | No | Bridge | Education, batteries-included robot library, cross-platform | Smaller ecosystem, less used in industry RL | | **CoppeliaSim** (V-REP) | ODE/Bullet/Vortex/Newton (4 engines) | OpenGL | No | Bridge | Swappable physics, scripting, sensors, prototyping | Closed-core, smaller modern community | **Gazebo (formerly Ignition), versions Harmonic and Ionic.** The default ROS simulator. If your robot runs ROS 2 and you want to test the *whole stack* — controllers, nav, perception, the lot — against simulated sensors and physics, this is the tool. It is modular (separate physics, rendering, sensor, GUI processes), DART is the default physics, and the sensor simulation is solid. It is *not* the tool for training a policy on 4,096 parallel environments; it was never designed for that. Strength: realism of the *software interface*. Weakness: throughput and photorealism. **NVIDIA Isaac Sim.** Built on Omniverse and USD (Universal Scene Description), PhysX 5 physics, RTX ray-traced rendering. This is the high-fidelity end: photoreal cameras, physically-based materials, accurate-ish sensor models, and a real path to a digital twin of a physical cell because USD is a proper scene-description and data-interchange format. It is heavy — you need an RTX GPU and patience for setup — but nothing else gives you sensor realism at this level with this much physics behind it. **NVIDIA Isaac Lab** (the successor to Isaac Gym and the older Orbit/Isaac Sim RL workflows). This is the GPU-parallel **learning** framework that sits on Isaac Sim's physics. It runs thousands of environments on a single GPU and is the production path for training locomotion and manipulation policies. Think of Isaac Sim as the simulator and Isaac Lab as the training harness on top of it. **MuJoCo** (DeepMind, open-source since 2021/2022). The connoisseur's choice for articulated-robot dynamics: generalized coordinates, a convex contact solver, stable at large timesteps. **MJX** is the JAX reimplementation that runs on GPU/TPU for massively-parallel RL, and **MuJoCo Playground** is the curated suite of RL environments on top. If you are doing locomotion or whole-body control research, MuJoCo's dynamics fidelity per unit of compute is hard to beat. The trade is a thinner sensor and rendering story. **PyBullet.** The Python binding to Bullet. Free, fast enough, runs anywhere, and has the largest collection of tutorials and research code of any of these. It is the right tool for a quick prototype, a class, or reproducing a paper. It is showing its age against the GPU sims for training and against Isaac Sim for fidelity, but for "I need a robot in a sim by tonight," it still wins. **Webots** (open-source, Cyberbotics). Batteries-included: a big library of robot and sensor models, cross-platform, friendly. Heavily used in education and competitions. Custom physics (ODE-derived). A solid all-rounder; less common in industrial RL pipelines. **CoppeliaSim** (formerly V-REP). Notable for letting you swap among four physics engines (ODE, Bullet, Vortex, Newton) in the same scene, strong scripting, good sensor models. A capable prototyping and education tool with a smaller modern community than the others. > **Opinion with reason:** most serious 2026 programs run **two** of these — a GPU sim (Isaac Lab or MuJoCo/MJX) to train, and a ROS-native sim (Gazebo, or Isaac Sim if you need fidelity) to integrate and regression-test. One tool optimized for throughput and one optimized for stack realism. Trying to do both jobs in one simulator is where teams waste months. ## Fidelity vs speed and the real-time factor Every simulation choice is a trade between fidelity and speed, and the single number that captures it is the **real-time factor**. ```text RTF = simulated_time / wall_clock_time RTF = 1.0 → sim runs at real speed (1 sim-second per wall-second) RTF = 10 → 10x faster than reality (great for batch testing) RTF = 0.1 → 10x slower than reality (heavy contact / sensors) ``` Computing it from the timestep and per-step cost: ```text Let dt = physics timestep (e.g. 0.001 s = 1 kHz) t_step = wall time per step (e.g. 0.0002 s = 200 µs) steps_per_sim_second = 1 / dt = 1000 steps wall_time_per_sim_sec = steps * t_step = 1000 * 200e-6 = 0.2 s RTF = 1 / 0.2 = 5.0 → 5x real-time on one CPU core ``` Levers that change `t_step` (and thus RTF): - **Timestep `dt`.** Halving `dt` doubles steps per sim-second → halves RTF. But too large a `dt` and stiff contacts go unstable. This is the central tension. - **Solver iterations.** More PGS iterations = more accurate contacts = slower. Fewer = spongy but fast. - **Collision complexity.** Convex primitives (box, sphere, capsule) are cheap; full triangle meshes are expensive. Decompose meshes into convex hulls. - **Sensor rendering.** A 1080p RTX camera at 30 Hz can dominate the entire step budget. Lidar ray-casts scale with beam count. - **Number of bodies and contacts.** Contact count drives solver cost super-linearly in bad cases. A useful mental model of the fidelity-speed spectrum: | Use case | Typical `dt` | Fidelity priority | Target RTF | Tool | |---|---|---|---|---| | RL training (parallel) | 4–10 ms (substepped) | Throughput, "good enough" contact | thousands (aggregate) | Isaac Lab, MJX | | Controller-in-the-loop | 1 ms | Dynamics + actuator model | ~1 (real-time) | MuJoCo, Gazebo | | Full-stack integration | 1–4 ms | Sensor + ROS interface realism | 0.3–2 | Gazebo, Isaac Sim | | Photoreal perception | 1–4 ms | Rendering / sensor realism | 0.05–0.5 | Isaac Sim | | Contact-rich manipulation | 0.5–2 ms | Contact/friction fidelity | 0.1–1 | MuJoCo, Isaac Sim | Note the aggregate RTF for parallel training: a single environment might run at RTF 2, but 4,096 of them in lockstep on one GPU produce an *aggregate* throughput equivalent to thousands of times real-time. That aggregate number is what makes RL tractable, and it is the subject of the next-but-one section. > **Rule:** real-time (RTF ≈ 1) only matters when a *human or real hardware* is in the loop. For batch testing run as fast as you can; for training run as parallel as you can; for hardware-in-the-loop you are pinned to RTF = 1 and must drop fidelity to hit it. ## Rendering and sensor simulation A robot does not perceive ground-truth state; it perceives *sensors*. If your sim hands the policy perfect joint angles and noise-free depth, you have trained on a robot that does not exist. Sensor simulation is a fidelity axis entirely separate from dynamics, and for perception-driven robots it is the *more* important one. **Cameras.** Two rendering paths. **Rasterization** (OGRE in Gazebo, OpenGL in PyBullet/Webots) is fast and fine for geometry and basic appearance. **Ray-tracing** (Isaac Sim's RTX) gives physically-based lighting, reflections, soft shadows, and global illumination — which matters when your perception net was trained to expect realistic light. The gap between a rasterized and a ray-traced frame is exactly the gap a vision model notices. **Depth cameras.** Easy to simulate naively (read the depth buffer) and hard to simulate well. Real depth sensors have characteristic artifacts: missing returns on dark/shiny/transparent surfaces, edge fattening, quantization, and — for stereo and structured light — failure in low texture. A depth image without those artifacts is too clean and will not transfer. See [LiDAR & depth cameras](/posts/lidar-depth-cameras-ultimate-guide/) for the real sensor physics you are trying to mimic. **Lidar.** Simulated by ray-casting against the collision/visual geometry: one ray per beam per angular step, returning range. Good lidar sim adds **intensity** (material- and angle-dependent return strength), **dropout** (no return on absorptive or specular surfaces), **range noise** (a few mm to cm), and motion distortion for spinning sensors. GPU ray-casting (Isaac Sim's RTX lidar) makes high-beam-count sensors affordable; CPU ray-casting a 128-beam lidar at 20 Hz is a real cost in Gazebo. **IMU.** The cheapest sensor to simulate badly and a common transfer killer. A real IMU has **bias** (slowly drifting offset), **random walk**, **white noise**, scale-factor error, and misalignment. Integrate a noise-free simulated IMU and your state estimator looks heroic; feed it a properly modeled one and you discover your filter tuning was fantasy. Model bias and noise, and randomize them. **Contact and force/torque sensors.** As accurate as the contact solver, which — per the contact section — means treat them with suspicion for absolute values and trust them more for *events* (contact made/broken) than magnitudes. A compact view of what to model: | Sensor | Cheap to fake | Must model for transfer | |---|---|---| | RGB camera | Geometry, color | PBR lighting, exposure, motion blur, lens distortion, sensor noise | | Depth | Depth buffer | Dropouts on shiny/dark/clear, edge artifacts, quantization | | Lidar | Range via ray-cast | Intensity, dropout, range noise, motion distortion | | IMU | Ground-truth accel/gyro | Bias, random walk, white noise, scale/misalignment | | Wheel encoder | Joint angle | Quantization, slip, backlash | | Force/torque | Solver contact force | Solver-dependent magnitudes — trust events over values | > **Opinion with reason:** for perception-driven robots, spend your fidelity budget on **sensor noise models before renderer photorealism.** A perfectly ray-traced but noise-free depth image transfers worse than a rasterized one with realistic dropouts, because the policy learns to trust depth edges that the real sensor never produces. Noise models are cheap and high-leverage; photorealism is expensive and only pays off for appearance-based perception. ## GPU-accelerated massively-parallel sim This is the development that changed robot learning, so it gets its own section. The old way: one simulation per CPU core. A workstation with 32 cores runs 32 environments. To collect the ~10⁹ environment steps a locomotion policy needs, you rented a CPU cluster and waited days to weeks. Robot RL was a big-lab activity because the data collection was a big-lab cost. The new way (Isaac Gym → **Isaac Lab**, and **MuJoCo MJX**): put thousands of independent environments on a single GPU, stepping them all in lockstep as batched tensor operations, with observations and actions never leaving GPU memory. The simulation, the neural-network policy, and the gradient updates all live on the same device. No CPU-GPU transfer bottleneck. The throughput math is the whole story: ```text Single CPU env: ~1,000–5,000 steps/s per core 32 cores ≈ 100k steps/s GPU parallel (one modern data-center / high-end GPU): N = 4,096 environments per-env step rate ≈ 5,000 steps/s (substepped, simplified contact) aggregate ≈ N * 5,000 ≈ 20,000,000 steps/s → ~200x the CPU cluster, on one GPU. ``` ```text Wall-clock to collect 1e9 steps: CPU cluster (100k steps/s): 1e9 / 1e5 = 10,000 s ≈ 2.8 hours ... per node (and you needed many nodes / days end-to-end) GPU parallel (2e7 steps/s): 1e9 / 2e7 = 50 s Quadruped locomotion that took days now trains in minutes-to-hours. ``` That collapse — days to hours — is why the 2020s wave of legged robots (and now humanoids) learned to walk, run, and recover in simulation. The famous ANYmal and quadruped results, and the locomotion stacks behind today's commercial quads and humanoids, were trained this way: thousands of parallel environments, heavy domain randomization, then zero-shot transfer to hardware. See [legged & quadruped robot hardware](/posts/legged-quadruped-robot-hardware-ultimate-guide/) and [humanoid robot hardware](/posts/humanoid-robot-hardware-ultimate-guide/) for the machines, and [reinforcement learning for robotics](/posts/reinforcement-learning-robotics-ultimate-guide/) for the algorithms that consume this firehose of data. The catch: GPU sim trades contact fidelity for throughput. To run thousands of environments fast you simplify collision geometry, substep the solver, and accept softer contacts. That is fine for locomotion (gaits are robust) and acceptable for many manipulation tasks with enough domain randomization, but it is *not* the tool for validating a delicate contact interaction. Train on the GPU sim, then validate the trickiest contacts on a higher-fidelity sim or hardware. > **Rule:** use GPU-parallel sim to *train* (throughput is king, fidelity is "good enough + randomization"); use a higher-fidelity sim to *validate* the contact-critical cases the fast sim glosses over. They are different jobs. ## The reality gap and sim-to-real The **reality gap** is the difference between your simulation and the real world. A policy or controller that works in sim and fails on hardware fell into the gap. Closing it is the central engineering problem of simulation-based development. Where the gap actually lives — ranked by how often it bites: 1. **Contact and friction** (the contact section — this is #1 for a reason). 2. **Actuator dynamics.** Real motors have torque limits, current limits, electrical and mechanical lag, gearbox backlash, and friction. A sim that commands ideal torque instantly is modeling a motor that does not exist. Model the actuator (a first-order lag plus torque saturation is a cheap, high-value start). 3. **Latency.** Sensing-to-actuation delay in sim is often zero; on hardware it is 1–20 ms through the stack. A controller tuned with zero latency can be unstable with real latency. 4. **Compliance and flexibility.** Real links flex, real joints have series elasticity, real cables tug. Rigid-body sim assumes none of it. 5. **Sensor noise and artifacts** (the sensor section). 6. **Mass and inertia errors.** Your CAD-derived inertia is wrong by some percent; the real robot's mass distribution shifted when someone added a cable harness. Three families of technique close the gap, and mature programs use all three. **System identification (sysID).** Make the sim match *this* robot by measuring real parameters and fitting the model: run the real actuator through a chirp, fit the motor model; measure the real friction and inertia; calibrate sensor noise. SysID narrows the gap by making the sim center on reality. It is necessary but never sufficient — you cannot measure everything, and parameters drift. **Domain randomization (DR).** Instead of one precise sim, train across a *distribution* of sims: randomize masses (±10–30%), friction coefficients (e.g. 0.4–1.2), actuator gains, latencies (0–20 ms), sensor noise, and — for vision — textures, lighting, and camera pose. The policy that survives all of them treats the real world as just one more sample from the training distribution. DR is the workhorse of modern sim-to-real and the reason zero-shot transfer works at all. **Dynamics randomization** is DR applied specifically to the physics parameters (mass, friction, damping, latency) as opposed to the visuals. **Visual domain randomization** randomizes appearance so a vision policy ignores texture and lighting it will never see again. Both matter; which dominates depends on whether your policy is proprioceptive (legs) or perceptive (vision-based manipulation). **Domain adaptation.** When randomization alone leaves a gap, adapt: fine-tune on a little real data, learn a model that maps sim observations to real ones (or vice versa), or use online system identification where the policy infers the real dynamics parameters from a short history and adjusts. "Rapid motor adaptation" and similar techniques — estimate the environment's latent parameters on the fly — are how the best legged policies handle terrain and payloads they never saw in training. ```text The sim-to-real recipe that actually works in 2026: 1. sysID the big things → center the sim on the real robot 2. model the actuator → lag + torque/current limits + backlash 3. add latency → match the real sensing→actuation delay 4. domain-randomize wide → mass, friction, gains, latency, noise, visuals 5. train at scale → GPU parallel, millions–billions of steps 6. adapt online (optional)→ infer latent dynamics, adjust on hardware 7. validate on hardware → narrow the gap on the cases that still fail ``` > **Opinion with reason:** if you can only do two things, do **actuator modeling** and **wide domain randomization.** Actuator modeling fixes the most common single cause of "works in sim, falls over on hardware," and wide DR buys robustness to everything you failed to model. Photorealistic rendering is a distant third for anything that isn't vision-dominated. ## Digital twins: what the word actually means "Digital twin" is the most abused term in the field, so let's be precise. A **digital twin** is a virtual model of a *specific* physical asset that is **kept synchronized with that asset in real time** via a live data link. The defining property is the synchronization: telemetry flows from the physical robot/cell into the model, and (often) commands or predictions flow back. The twin reflects the *current state* of *that one* machine — its wear, its calibration, its current payload — not a generic model of its type. Contrast with a plain **simulation**: a model of a robot or cell used offline for design, testing, or training. It might be extremely detailed. It is not a twin, because it is not synchronized with a specific live asset. The useful distinction is the data link: | | Offline simulation | Digital twin | |---|---|---| | Tied to a specific physical asset | No (a model of a *type*) | Yes (a model of *that unit*) | | Live data sync | No | Yes — continuous telemetry | | Reflects wear/calibration/state | No | Yes | | Primary use | Design, test, train | Monitor, predict, optimize *that asset* | | Runs when asset is off | Yes | Usually paired with the running asset | What a real digital twin is good for: **predictive maintenance** (the twin runs ahead of the real machine and flags an impending bearing failure), **what-if on the live system** (test a new cycle on the twin before pushing it to the running cell), **anomaly detection** (real telemetry diverges from twin prediction → something is wrong), and **operator training / monitoring** on the actual deployed configuration. The honest take: most products marketed as "digital twins" are **offline simulations with a telemetry dashboard.** That is still useful — a good sim of your cell plus a live data view is valuable — but if there is no real-time model running in step with the physical asset and being corrected by its data, it is not a twin in the meaningful sense. Isaac Sim with USD is one of the few stacks built to do the real thing, because USD is a proper bidirectional scene/data format and Omniverse is designed for live synchronization. Gazebo can be wired into a twin-like loop with ROS 2 telemetry, but you are building the sync layer yourself. > **Rule:** before you call something a digital twin, ask "what is the live data link, and does the model state change when the real asset's state changes?" No link, no twin. It's a sim — which is fine, just name it correctly. ## When the simulation lies Every simulator lies. The professional skill is knowing *which* lies yours tells so you don't trust a result it can't support. **Contact lies.** Already covered, and the biggest one. Stacking, grasping, pushing, and any task where the *exact* contact behavior matters is suspect. The friction your gripper relies on, the precise moment a foot slips, the way a peg jams in a hole — these are where rigid-body engines are weakest. **Deformables lie.** Cables, fabric, foam, food, skin, soft grippers — rigid-body engines either skip them or fake them with simplified models (mass-spring, position-based dynamics, or finite-element add-ons that are slow). If your task involves a deformable object and your sim is a rigid-body engine, the sim's behavior is decorative. Specialized FEM/soft-body sims exist but are slow and narrow. **Friction lies.** Coulomb friction with a single coefficient is a model, not reality. Real friction is velocity-dependent (static > kinetic), surface-dependent, contamination-dependent, and wears over time. The linearized friction cone (the pyramid) adds directional bias on top. Never trust a single friction number. **Sensor artifact lies.** Default sensors are too clean. Depth has no dropouts, cameras have no motion blur or rolling shutter, lidar has no intensity falloff, IMUs have no bias. Each missing artifact is a way the real sensor will surprise your perception stack. **Numerical lies.** Energy can leak or be injected by the integrator; under-iterated solvers make joints feel loose; large timesteps make stiff contacts bouncy or unstable; penetration-recovery impulses launch objects ("the object squirts out"). These are artifacts of *how* the sim computes, not of any physics. **The determinism trap.** A sim can be perfectly deterministic — same seed, same result — and perfectly wrong. Determinism is great for CI and debugging; it is not evidence of physical accuracy. A reproducible lie is still a lie. > **Rule:** maintain a written list of "things our sim does not model" (deformables, exact friction, sensor X's artifact, cable drag) and gate every sim-only claim against it. The result you should distrust most is the one that depends on the physics your engine approximates worst. ## Validation and CI in simulation Simulation's most underused superpower is **continuous integration.** A sim is a repeatable environment; a repeatable environment is testable; a testable system can be guarded against regressions automatically. Most teams build a sim and never wire it into CI. That is leaving the best value on the table. What a sim CI pipeline looks like: - **Headless, containerized sim.** No GUI, runs in a Docker container on a CI runner. Gazebo runs headless cleanly; Isaac Sim has headless modes; MuJoCo/PyBullet are trivial to run headless. - **Deterministic seeds.** Fix the random seed so a failure is reproducible. (Remember the determinism trap: this makes the test repeatable, not physically authoritative.) - **Scripted scenarios.** "Navigate from A to B avoiding the obstacle," "pick the part from this pose," "recover from this push." Each scenario is a test case. - **Quantitative pass/fail metrics.** Not "did it look right" but "final position error < 5 cm," "no collision events," "task completed within 12 s," "joint torque stayed under limit." Numbers, with units, and thresholds. - **Run on every merge.** The point is to catch the regression in the PR, not in the field. A staged validation ladder, cheapest to most expensive: 1. **Unit / logic tests** — no physics, just code. Milliseconds. 2. **Fast sim regression** — PyBullet/MuJoCo headless, scripted scenarios, deterministic. Seconds to minutes. Runs on every commit. 3. **Full-stack sim** — Gazebo or Isaac Sim with the real ROS 2 stack and realistic sensors. Minutes. Runs nightly or per-merge on key branches. See [ROS 2](/posts/ros2-ultimate-guide/) for the stack this exercises. 4. **Hardware-in-the-loop (HIL)** — real controller/compute, simulated plant, RTF pinned to 1. Catches timing and latency bugs sim misses. 5. **Hardware test** — the truth. Reserved for what passed everything above. The reason to invest here is the same as for any test suite: it converts "we think it still works" into "we know it still works, here's the green run." For robotics that conversion is worth more than usual, because the alternative way to discover a regression is a robot driving into a wall. > **Opinion with reason:** put a *fast deterministic sim regression suite* in CI before you build anything fancier. It is the cheapest tier and catches the most bugs per dollar — logic errors, broken interfaces, obvious controller breakage — long before you spend GPU time on a photoreal twin. ## Selecting a simulation stack Choose by the job in front of you. The honest decision tree: **"I need to test my ROS 2 stack against simulated sensors and physics."** → **Gazebo (Harmonic or Ionic).** First-class ROS 2 integration, good sensor sim, DART physics. The default for system and integration testing. **"I need to train a locomotion or manipulation policy with RL, fast."** → **Isaac Lab** (if you have NVIDIA RTX hardware and want the full Omniverse ecosystem) or **MuJoCo MJX / Playground** (if you want open-source, cleaner articulated dynamics, and JAX). Both give GPU-parallel throughput. See [reinforcement learning for robotics](/posts/reinforcement-learning-robotics-ultimate-guide/). **"I need photoreal sensors and/or a real digital twin of a physical cell."** → **Isaac Sim.** RTX rendering, PhysX 5, USD pipeline, the only one of these built for live synchronization at scale. Budget for the GPU and the setup time. **"I need a quick prototype, a teaching tool, or to reproduce a paper."** → **PyBullet.** Free, fast, hackable, enormous tutorial base. Or **MuJoCo** if the paper used it (much robotics RL research does). **"I want batteries-included with a big robot library for education or competition."** → **Webots** or **CoppeliaSim.** A selection matrix on the axes that actually decide it: | If your priority is... | Pick | |---|---| | ROS 2 integration & system testing | Gazebo | | GPU-parallel RL training | Isaac Lab or MuJoCo MJX | | Articulated-dynamics fidelity / research | MuJoCo | | Photoreal sensors & digital twins | Isaac Sim | | Fast free prototyping | PyBullet | | Education, batteries-included | Webots / CoppeliaSim | | Swappable physics engines in one scene | CoppeliaSim | And the meta-decision most teams get wrong: > **Opinion with reason:** do not try to make one simulator do every job. Run a GPU sim for training and a ROS-native sim for integration. The cost of running two tools is far lower than the cost of fighting a training framework to do integration testing, or a integration sim to do parallel RL. Specialize the tools; share the robot model (URDF/USD/MJCF) across them as much as you can — and budget for the fact that model formats and contact behavior will not perfectly match between them, which is itself a small reality gap to manage. The model-format reality: **URDF** is the ROS lingua franca (Gazebo, and importable elsewhere), **MJCF** is MuJoCo's native format, and **USD** is the Isaac/Omniverse format. Converters exist and mostly work for kinematics and visuals; they do *not* reliably carry contact parameters, friction, and actuator models across. Re-tune physics per simulator. Treat a clean cross-tool import as a bonus, not a guarantee. ## Frequently asked questions **Which simulator should a beginner start with?** PyBullet for the gentlest on-ramp (free, Python, huge tutorial base), or Gazebo if you are already in ROS 2. Move to MuJoCo or Isaac Lab once you hit RL and need throughput. Starting with Isaac Sim is a steep first climb unless photorealism or a digital twin is the actual goal. **Is Gazebo the same as Ignition?** Yes. The project formerly called Ignition Gazebo was renamed back to "Gazebo" (the original Gazebo Classic is now legacy). Current releases are named alphabetically — Harmonic and Ionic are the recent ones. If a tutorial says "Ignition," it means modern Gazebo. **Why do my grasp results differ between PyBullet and Isaac Sim?** Different physics engines (Bullet vs PhysX), different contact and friction models, different solver settings, and likely different friction parameters after import. Contact-rich tasks are exactly where engines disagree most. Re-tune friction and contact stiffness per engine and never assume a grasp tuned in one transfers to another — let alone to hardware. **Do I really need a GPU for robot simulation?** Not for everything. Gazebo, PyBullet, MuJoCo (CPU), Webots, and CoppeliaSim run fine on CPU for single-environment integration and prototyping. You need a GPU for two things: photoreal rendering (Isaac Sim's RTX) and GPU-parallel RL training (Isaac Lab, MuJoCo MJX). If you're doing large-scale RL, the GPU is not optional. **What timestep should I use?** Start at 1 ms (1 kHz) for contact-rich or stiff systems; you can often go to 2–5 ms with MuJoCo's stable solver, or substep in PhysX/Isaac. If contacts get bouncy, joints feel loose, or the sim explodes, the timestep is too large or the solver under-iterated. Smaller `dt` costs linearly in compute via lower RTF. **How do I actually close the reality gap?** In order: model the actuator (lag + torque/current limits + backlash), add realistic sensing-to-actuation latency, run wide domain randomization over masses/frictions/gains/latency/noise, train at scale, and optionally adapt online. SysID centers the sim on your robot; randomization makes the policy robust to what you couldn't measure. Then validate on hardware. **Is domain randomization always the right move?** For sim-to-real transfer of learned policies, almost always yes — it trades a little peak sim performance for robustness, which is the correct trade for deployment. The exception is when you have a very accurate model and a precise, repeatable environment (some industrial cells), where tight sysID can beat wide randomization. For anything operating in the messy real world, randomize. **Can a digital twin replace hardware testing?** No. Even a real, synchronized twin is a model corrected by data; it cannot discover physics it doesn't model. A twin reduces, predicts, and monitors — it does not eliminate the need to validate on the physical asset. Anyone selling a twin as a hardware-test replacement is overselling. **Why does MuJoCo feel more stable than ODE or Bullet?** Generalized coordinates (joints can't drift apart) plus a convex contact solver and implicit integration. That combination stays stable at larger timesteps and at the high stiffness and mass ratios real articulated robots have, where iterative PGS solvers in maximal coordinates struggle. It's a genuinely better fit for arms, legs, and humanoids. **What's the difference between Isaac Sim, Isaac Gym, and Isaac Lab?** Isaac Sim is the full simulator (PhysX + RTX + USD). Isaac Gym was the original standalone GPU-parallel RL environment (now deprecated). Isaac Lab is the current GPU-parallel learning framework, built on Isaac Sim's physics, that replaced Isaac Gym and the earlier Orbit workflow. For new RL work, use Isaac Lab. **How fast can simulation actually run?** A single contact-heavy, sensor-rich environment can run *below* real-time (RTF < 0.1). A simple environment runs many times real-time on one CPU core. GPU-parallel sim runs thousands of environments at once, for an aggregate throughput equivalent to thousands of times real-time — which is why RL data collection that used to take days now takes hours. **Should sensor noise be modeled even for non-learning controllers?** Yes, if perception feeds the controller. A state estimator or perception stack tuned against noise-free simulated sensors is tuned against a fantasy. At minimum model the noise and bias of the sensors your control loop depends on, so your filter tuning and failure handling face something resembling reality. ## Changelog - **2026-06-10** — Initial publication. --- # Motor Controllers & Field-Oriented Control (FOC): The Ultimate Guide URL: https://blog.robo2u.com/posts/motor-controllers-foc-ultimate-guide/ Published: 2026-06-09 Updated: 2026-06-20 Tags: motor-controllers, foc, field-oriented-control, esc, vesc, odrive, svpwm, clarke-park, power-electronics, guide Reading time: 35 min > A deep, practical guide to motor controllers and Field-Oriented Control: the three-phase power stage, Clarke/Park math, the current-velocity-position cascade, sensorless observers, tuning, and picking ODrive vs Moteus vs VESC vs industrial servo drives. A motor by itself is a dumb electromagnet. It is the controller — the box of MOSFETs, the current sensors, and a few kilobytes of fast-loop firmware — that decides whether your three-phase machine behaves like a screaming RC drone motor or a precision servo that holds 0.01° under load. The motor sets the ceiling on torque and speed; the controller decides how much of that ceiling you actually reach, and how gracefully. This guide is about that controller, and specifically about Field-Oriented Control (FOC), the algorithm that turns a synchronous AC machine into something you can command like a DC motor. We will go through the power stage transistor by transistor, derive the Clarke and Park transforms (correctly, with the conventions stated), walk the current→velocity→position cascade, deal with the rotor-position problem and sensorless observers, and then get concrete about real hardware: ODrive, Moteus, VESC, SimpleFOC, and the industrial drives from Copley, Elmo, and Kollmorgen. **The take**: FOC is not exotic anymore — it is the default for any brushless machine where you care about torque quality, efficiency, or quiet operation, and a $50 board now runs the same dq-frame control loop that cost $3,000 a decade ago. What still separates a good drive from a bad one is not the math (everyone has the math) but the power stage, the current sensing, the loop rate, and the protection. Get those right and FOC is almost boring. Get them wrong and no amount of clever control hides a noisy current sensor or a 50 µs loop. Companion reading: [brushless DC motors](/posts/brushless-dc-motors-bldc-ultimate-guide/), [servo motors](/posts/servo-motors-ultimate-guide/), [encoders](/posts/encoders-ultimate-guide/), [real-time control systems](/posts/real-time-control-systems-ultimate-guide/), and [robot actuators](/posts/robot-actuators-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [What a motor controller actually does](#what-it-does) 3. [The power stage: inverter, transistors, gate drivers, sensing](#power-stage) 4. [Commutation methods: six-step vs sinusoidal vs FOC](#commutation) 5. [FOC explained properly: Clarke, Park, and the dq frame](#foc-math) 6. [The control cascade: current, velocity, position](#cascade) 7. [Rotor position and the sensor problem](#position) 8. [PWM, switching, dead-time, and field weakening](#pwm) 9. [Tuning a FOC drive](#tuning) 10. [The drive ecosystem: hobby vs industrial](#ecosystem) 11. [Communication and real-time interfaces](#comms) 12. [Protection and fault handling](#protection) 13. [Choosing a controller for your robot](#choosing) 14. [Frequently asked questions](#faq) ## Key takeaways - A motor controller turns a DC bus plus a torque/velocity/position command into three coordinated phase currents. The motor provides the torque capability; the controller provides the *control*. - **Six-step (trapezoidal) commutation** is cheap and fine for fans, pumps, and drones at speed. It produces ~14% torque ripple and is poor at low speed. **FOC** produces smooth torque from zero speed and is the right default for any servo-grade application. - FOC's whole trick is a coordinate transform: **Clarke** (3-phase → 2-axis stationary αβ) then **Park** (αβ → rotor-synchronous dq). In the dq frame the AC quantities become DC, so two ordinary PI loops can regulate torque (q-axis) and flux (d-axis). - For a non-salient PMSM you run **Id = 0** (all current makes torque) until you hit the voltage ceiling, then **field weakening** drives Id negative to go faster at the cost of torque. - The control structure is a **cascade**: an inner current/torque loop (kHz–tens of kHz), a middle velocity loop, and an outer position loop. Each outer loop should be roughly **5–10× slower** in bandwidth than the one inside it. - **Current-loop gains come straight from motor R and L.** With pole-zero cancellation, `Kp = L·ωc` and `Ki = R·ωc`, where `ωc` is your target current-loop bandwidth in rad/s. This is the single most useful equation in the guide. - Knowing the **rotor angle** is non-negotiable for FOC. Use an encoder or absolute sensor when you can; sensorless back-EMF/observer schemes work well at speed but struggle at zero and low speed. - **Switching frequency** (typically 8–60 kHz) trades switching loss against current ripple and control bandwidth. **Dead-time** (0.1–2 µs) prevents shoot-through but distorts the output and needs compensation. - Hobby/robotics drives (**ODrive, Moteus, VESC, SimpleFOC**) now deliver real FOC at $50–$250. Industrial drives (**Copley, Elmo, Kollmorgen**) add deterministic fieldbus, certified safety, and support — at 5–20× the price. - **Protection makes or breaks reliability**: hardware overcurrent trip, I²t thermal modeling, overtemp, and overvoltage/regen handling (brake resistor or regenerative bus) are not optional on a real machine. - Pick a controller by **bus voltage, continuous and peak phase current, sensor support, comms, and form factor** — in that order. Most failed selections are current or thermal mistakes, not feature mistakes. ## What a motor controller actually does Strip away the marketing and a motor controller does one thing: it takes a **DC bus** (a battery, a rectified mains supply, a bench supply) and a **command** (torque, velocity, or position), and synthesizes the **phase currents** that make the motor follow that command. Everything else — the comms, the displays, the safety relays — is scaffolding around that core job. For a brushed DC motor the job is almost trivial: current is proportional to torque, so a single H-bridge with PWM duty controls torque directly. There is no commutation to do because the motor's mechanical commutator already does it. This is why brushed-motor "controllers" are so simple, and why brushed motors persist in low-cost gear. For a **brushless** machine — BLDC or PMSM, see the [brushless DC motors guide](/posts/brushless-dc-motors-bldc-ultimate-guide/) — there is no mechanical commutator. The controller *is* the commutator. It must continuously decide, based on rotor angle, which windings to energize and how hard, so the stator field stays roughly 90 electrical degrees ahead of the rotor field. That is the whole game: keep the produced torque maximal and smooth by keeping the field angle right. > **Rule of thumb**: the controller is what turns a motor into a *servo*. A motor has a torque constant; a servo has a torque *command you can trust*. ### Torque, current, and the role of the controller In a permanent-magnet machine, torque is (to first order) proportional to the current component that is orthogonal to the rotor flux. Control that current and you control torque. The controller closes a loop around current precisely so that when you ask for 10 N·m, the firmware drives whatever phase voltages are needed — accounting for back-EMF, resistance, and inductance — to make the torque-producing current equal to its target. This is the conceptual leap that separates a "driver" from a "controller": a driver applies voltage; a controller regulates current (and therefore torque) by closing a feedback loop hundreds or thousands of times per second. ## The power stage: inverter, transistors, gate drivers, sensing Before any math, there has to be hardware that can actually push amps into windings. For a three-phase machine that hardware is a **three-phase inverter**: three half-bridges, one per motor phase, six switches total. ### The three-phase inverter Each half-bridge (a "leg") has a **high-side** switch connecting the phase to V+ and a **low-side** switch connecting it to ground. By PWM-modulating each leg's duty cycle you set the average voltage on each phase. Three legs, three phase voltages, and the difference between them is what drives current through the motor's star- or delta-connected windings. The six switches are never all independent: in each leg, high and low must never be on simultaneously (that is a dead short across the bus — "shoot-through" — and it destroys transistors in microseconds). Hence dead-time, covered later. ### MOSFET vs IGBT vs GaN The switch technology you pick is mostly a function of bus voltage and switching frequency: - **Silicon MOSFETs** dominate from a few volts up to ~200 V (and increasingly to 650 V). Low on-resistance (`R_DS(on)` in the single-digit milliohms for good 40–100 V parts), fast switching, cheap. Nearly every hobby and robotics drive uses them. ODrive, VESC, and Moteus are all MOSFET designs. - **IGBTs** take over at high voltage and high power — think 600 V to 1700 V, tens to thousands of amps, industrial and traction drives. They have a fixed ~1–2 V saturation drop (bad at low current) but scale to power levels MOSFETs cannot. Switching is slower, so IGBT drives often run 4–16 kHz PWM. - **GaN** (gallium nitride) and **SiC** (silicon carbide) are the modern wide-bandgap options. GaN excels at lower voltages (≤650 V) with extremely fast switching and tiny losses, enabling >100 kHz PWM and very compact drives. SiC owns the 650 V–1200 V high-power space (EV traction inverters). Both cost more and demand careful layout because the fast `dV/dt` (tens of V/ns) makes EMI and gate-loop parasitics unforgiving. > **Rule of thumb**: under 60 V, use silicon MOSFETs unless you have a specific reason not to. GaN is worth it when size or switching loss dominates; SiC and IGBT belong above a few hundred volts. ### Gate drivers and bootstrap A logic-level microcontroller pin cannot switch a power MOSFET fast enough — gate charge is too large and the high-side gate needs to float above the bus. That is the **gate driver's** job: it takes a PWM logic signal and delivers several amps of gate current to switch the FET in tens of nanoseconds. The high-side switch is the tricky one. Its source floats at the phase voltage, which swings between 0 and V+. To turn it fully on, the gate must be driven *above* V+. Two common solutions: - **Bootstrap**: a capacitor charges through a diode to roughly the gate-drive rail (~12 V) while the low-side is on and the phase is near ground; that charge then floats the high-side gate supply when the high-side turns on. Cheap, but the bootstrap cap must be periodically refreshed, so you cannot hold a phase high indefinitely at zero speed without a charge-pump or isolated supply. - **Isolated supplies / charge pump**: an isolated DC-DC per high-side, or a charge pump, supplies the high-side gate continuously. More expensive, but mandatory for sustained DC output (e.g., a servo holding torque at zero speed). This bootstrap limitation is a real gotcha: some cheap ESCs visibly struggle to hold a stalled motor because the bootstrap caps droop. Servo-grade drives use isolated or charge-pump high-side supplies for exactly this reason. ### Current sensing: shunt vs hall FOC needs **phase current measurements**, and the quality of those measurements sets a hard ceiling on control quality. You cannot regulate what you cannot see. - **Low-side shunt resistors**: a small resistor (e.g., 0.5–2 mΩ) in series with each low-side FET, measured with a differential amplifier. Cheap and accurate, but you can only read current when the low-side is on, so sampling must be synchronized to the PWM (sample in the middle of the low-side-on window). At very high duty cycles the low-side window shrinks and measurement gets hard — three-shunt designs help, and many drives reconstruct the third phase from `Ia + Ib + Ic = 0`. - **Inline (phase) shunts**: resistor directly in the phase wire with a high-side-capable or isolated amplifier. Measures continuously regardless of switch state, which is cleaner for FOC, at higher cost and complexity. ODrive and Moteus use inline/high-side sensing. - **Hall-effect current sensors** (e.g., closed-loop or magnetoresistive): galvanically isolated, no insertion loss, good for high current and high voltage. More expensive, more board area, and bandwidth/offset can be limiting. Common in industrial and high-power drives. > **Rule of thumb**: two phase-current measurements are enough (the third is `-(Ia+Ib)`), but three measurements give you redundancy, fault detection, and better performance near 100% duty. ### The DC bus The **bus capacitor** is not a detail. The inverter draws pulsed current from the bus at the switching frequency, and the source (battery, supply) cannot respond that fast. Bus capacitance — bulk electrolytics plus ceramic decoupling close to the FETs — supplies the high-frequency ripple current and clamps voltage transients. Undersized bus caps cause voltage ripple, EMI, and in the worst case overvoltage trips during regen. A drive that ignores its bus capacitor will be noisy and unreliable no matter how good the firmware is. ## Commutation methods: six-step vs sinusoidal vs FOC There are three families of commutation for a brushless machine, in increasing order of sophistication and torque quality. ### Six-step (trapezoidal) — the BLDC ESC In **six-step** or **trapezoidal** commutation, at any instant exactly two of the three phases conduct and one floats. As the rotor turns, the controller switches through six conduction states (hence "six-step"), each spanning 60 electrical degrees, typically using **Hall sensors** or back-EMF zero-crossing on the floating phase to know when to commute. It is simple and computationally trivial — a lookup table and a PWM duty. It is also what most RC/drone ESCs do. The downside is **torque ripple**: because current is held flat across each 60° sector while the back-EMF and ideal current vary, torque pulses at six times the electrical frequency, on the order of **~14% peak-to-peak** ripple in the ideal case, worse in practice. At low speed this ripple is audible and felt as cogging-like roughness, and back-EMF sensing fails near zero speed. ### Sinusoidal commutation **Sinusoidal** (or "sine") commutation drives all three phases continuously with sinusoidal currents phased 120° apart, tracking rotor angle from a position sensor. This eliminates the six-step torque ripple and is smooth and quiet. But classic sinusoidal control regulates the *phase* currents directly in the stationary frame, where the targets are time-varying sinusoids — and PI controllers have finite bandwidth, so they lag and lose accuracy as speed rises. It is smooth at low speed but degrades at high speed. ### FOC (vector control) **FOC** keeps the smooth sinusoidal currents but transforms the control problem into the rotor's rotating frame, where the quantities become DC and the PI loops face a constant setpoint at any speed. It also explicitly decouples torque-producing current from flux-producing current. The result is smooth torque from zero to top speed, optimal torque per amp, and the ability to do field weakening. The cost is more computation (the transforms) and a need for accurate, fast rotor-angle and current measurement. | Method | Torque ripple | Low-speed quality | High-speed quality | Sensor need | Compute | Typical use | |---|---|---|---|---|---|---| | Six-step / trapezoidal | High (~14%+) | Poor | Good | Hall or sensorless BEMF | Trivial | Drones, fans, pumps, e-bikes (cheap) | | Sinusoidal | Low | Good | Degrades with speed | Needs angle (encoder) | Moderate | Quiet appliance/HVAC, basic servo | | FOC (vector) | Very low | Excellent | Excellent | Needs accurate angle | Higher (transforms) | Robotics, servos, EVs, anything precise | > **Rule of thumb**: if it spins fast and roughness doesn't matter (a propeller, a pump), six-step is fine and cheaper. If you need controllable torque, smoothness, or motion at low/zero speed, use FOC. ## FOC explained properly: Clarke, Park, and the dq frame Here is the part people get hand-wavy about. Let us do it correctly, stating conventions. The problem: in the stator frame, phase currents `Ia, Ib, Ic` are sinusoids that vary with rotor position. Controlling sinusoids with PI loops is hard because the target keeps moving. The solution is two coordinate transforms that take us into a frame that rotates *with* the rotor, where the currents we care about are constant (DC) in steady state. ### Step 1 — Clarke transform: 3-phase → 2-axis stationary (αβ) The three phase currents are not independent (they sum to zero in a star connection), so two orthogonal axes fully describe them. The **Clarke transform** maps `(Ia, Ib, Ic)` onto a stationary two-axis frame `(Iα, Iβ)` where α is aligned with phase A. Using the amplitude-invariant (2/3) convention: ```text Clarke transform (amplitude-invariant, assuming Ia + Ib + Ic = 0): Iα = Ia Iβ = (Ia + 2·Ib) / sqrt(3) Full form (not assuming sum = 0): Iα = (2/3) · ( Ia - 0.5·Ib - 0.5·Ic ) Iβ = (2/3) · ( (sqrt(3)/2)·Ib - (sqrt(3)/2)·Ic ) ``` The αβ frame is still stationary — `Iα` and `Iβ` are still sinusoids as the rotor turns. We have just gone from three numbers to two. The real magic is next. ### Step 2 — Park transform: stationary αβ → rotating dq The **Park transform** rotates the αβ vector by the rotor electrical angle `θe`, into a frame that spins synchronously with the rotor. The **d-axis** (direct) is aligned with the rotor's permanent-magnet flux; the **q-axis** (quadrature) is 90 electrical degrees ahead and is the torque-producing axis. ```text Park transform (αβ -> dq), θe = rotor electrical angle: Id = Iα·cos(θe) + Iβ·sin(θe) Iq = -Iα·sin(θe) + Iβ·cos(θe) ``` Because the frame rotates with the rotor, the sinusoidal αβ currents become **constant** Id and Iq in steady state. That is the whole point: **AC control becomes DC control.** A PI controller regulating a DC quantity has zero steady-state error and behaves beautifully — none of the lag problems of chasing a moving sinusoid. The physical meaning: - **Iq** is the current orthogonal to the rotor flux → it produces torque. Torque ≈ `(3/2)·(P/2)·λ_pm·Iq` for a surface-PM machine, where `P` is pole count and `λ_pm` is the magnet flux linkage. - **Id** is the current aligned with the rotor flux → it produces no useful torque in a non-salient machine; it adds to or weakens the magnet flux. ### Step 3 — Id = 0 control For a **surface-mount PMSM** (non-salient, `Ld ≈ Lq`), every amp of d-axis current is wasted heat that produces no torque. So the d-axis setpoint is **Id\* = 0**: put all your current into the q-axis, getting maximum torque per amp (MTPA). For **interior PM** or salient machines, MTPA actually wants a small negative Id to exploit reluctance torque — but Id = 0 is the correct, simple default for the surface-PM motors most robots use. ### Step 4 — The two PI current loops Now we have two clean DC control problems: - A **q-axis PI loop** drives `Iq → Iq*` (the torque command from the outer loops). Its output is `Vq`, the q-axis voltage demand. - A **d-axis PI loop** drives `Id → 0` (or the field-weakening setpoint). Its output is `Vd`. The two axes are slightly coupled through speed (the `ω·L·I` cross terms and back-EMF). Good FOC adds **decoupling feedforward** terms so each PI loop sees an almost independent first-order plant: ```text Decoupling feedforward (added to PI outputs): Vd_ff = -ωe · Lq · Iq Vq_ff = +ωe · (Ld · Id + λ_pm) ``` ### Step 5 — Inverse Park, then SVPWM The PI loops give us `(Vd, Vq)` in the rotating frame. To actually command the inverter we rotate back to the stationary frame with the **inverse Park transform**: ```text Inverse Park (dq -> αβ): Vα = Vd·cos(θe) - Vq·sin(θe) Vβ = Vd·sin(θe) + Vq·cos(θe) ``` Then `(Vα, Vβ)` — a voltage vector in the stationary plane — is realized by the inverter using **Space Vector PWM (SVPWM)**. Conceptually SVPWM approximates the desired voltage vector as a time-weighted average of the eight discrete states the inverter can produce (six "active" vectors 60° apart, plus two "zero" vectors with all-high or all-low). It computes how long to spend in the two adjacent active vectors and the zero vectors over each PWM period. The practical reason to use SVPWM rather than naive sinusoidal PWM: it uses the DC bus about **15.5% more effectively** (it can synthesize a fundamental amplitude up to `Vdc/√3` rather than `Vdc/2`), because it injects a third harmonic / common-mode offset that cancels across the line-to-line voltages. More bus utilization means more speed and more torque headroom from the same battery. ### The complete FOC loop, in order Putting it together, every current-loop tick (typically every 25–125 µs): ```text 1. Sample phase currents Ia, Ib (Ic = -(Ia+Ib)) [synchronized to PWM] 2. Read rotor electrical angle θe from sensor/observer 3. Clarke: (Ia, Ib) -> (Iα, Iβ) 4. Park: (Iα, Iβ, θe) -> (Id, Iq) 5. PI loops: Id->Id*=0 gives Vd ; Iq->Iq* gives Vq (+ decoupling) 6. Inverse Park: (Vd, Vq, θe) -> (Vα, Vβ) 7. SVPWM: (Vα, Vβ) -> three PWM duty cycles 8. Update inverter compare registers ``` That loop, run fast and fed accurate current and angle, is FOC. Everything in the rest of this guide is in service of running it well. ## The control cascade: current, velocity, position FOC's current loop regulates torque. But you rarely command raw torque to a robot joint — you command a *position* or a *velocity*. So real drives stack three nested loops, the classic **cascade**: ```text position* -> [POSITION PI/P] -> velocity* -> [VELOCITY PI] -> Iq* (torque) -> [CURRENT PI x2 = FOC] -> inverter ``` - **Inner: current (torque) loop.** The FOC dq loops. Fastest, runs at the PWM-synchronized rate (e.g., 10–40 kHz). It must be the fastest because everything outside it assumes torque is "instant." - **Middle: velocity loop.** Takes a velocity command, compares to measured velocity (from encoder differentiation or observer), outputs a torque command. Runs at, say, 1–8 kHz. - **Outer: position loop.** Takes a position command, compares to measured position, outputs a velocity command. Often just proportional, runs at hundreds of Hz to a few kHz. ### Bandwidth separation The cascade only works if the loops are **separated in bandwidth**. Each loop must be fast enough that the loop *inside* it looks instantaneous, and slow enough that it doesn't fight the loop *outside* it. > **Rule of thumb**: target roughly a **5–10× bandwidth ratio** between adjacent loops. If your current loop is ~1 kHz, velocity loop ~100–200 Hz, position loop ~10–30 Hz. Violate this and the loops interact, you get oscillation, and tuning becomes a nightmare. ### Feedforward Pure cascaded feedback always lags — the error has to *exist* before the controller reacts. **Feedforward** injects a predicted command ahead of the error: - **Velocity feedforward** into the position loop: feed the commanded velocity directly, so the position loop only corrects the residual. - **Acceleration / torque feedforward** into the velocity loop: from a trajectory's known acceleration and the load inertia, compute the torque you *know* you'll need (`τ = J·α`) and add it directly to Iq\*. Done well, feedforward lets a drive track a smooth trajectory with tiny following error while keeping feedback gains modest. This is standard on industrial motion controllers and increasingly on robotics drives like Moteus and ODrive. For where these loops physically run and at what determinism, see [real-time control systems](/posts/real-time-control-systems-ultimate-guide/). ## Rotor position and the sensor problem FOC's Park transform needs the **rotor electrical angle θe** every tick. Get it wrong by even 10–20 electrical degrees and you lose torque and efficiency; get it 90° wrong and the motor produces no torque or runs away. So rotor position sensing is the most consequential decision in a FOC system after the power stage. ### Encoders and absolute sensors The clean answer is a **position sensor** on the shaft. See the [encoders guide](/posts/encoders-ultimate-guide/) for depth, but in brief: - **Magnetic absolute encoders** (e.g., on-axis Hall-array chips like the AS5047/AS5048, or the iC-Haus/MA-series parts) give 12–14 bit absolute angle over SPI/ABI, are cheap, and are the workhorse of robotics drives. ODrive and Moteus default to these. - **Optical incremental encoders** give high resolution and accuracy but are incremental — you need an index pulse or commutation hall sensors to find the absolute angle at startup. - **Resolvers** are rugged, analog, absolute, and standard in industrial/automotive servo motors; they need resolver-to-digital conversion. With a known **electrical angle offset** (the alignment between the encoder zero and the rotor's d-axis, found by a calibration routine at startup), the sensor gives θe directly and FOC just works from zero speed. ### Hall sensors Three Hall sensors give 60°-resolution commutation states — enough for six-step, and enough to bootstrap FOC at startup, but too coarse for high-quality FOC angle on their own. Some drives interpolate Hall transitions with velocity, or use Halls only to seed a sensorless observer. ### Sensorless: observers and back-EMF A sensor adds cost, wiring, and a failure point. **Sensorless** FOC estimates θe from the electrical signals alone: - **Back-EMF / flux observers**: the rotor's motion induces a back-EMF proportional to speed; by observing the motor's voltage and current and running a model (a flux-linkage observer, a Luenberger observer, or an extended Kalman filter), you can estimate the flux angle and thus θe. Texas Instruments' **InstaSPIN-FOC** packages exactly this (their "FAST" flux/angle/speed/torque estimator) in ROM. - **Sliding-mode observers (SMO)**: a robust nonlinear observer that estimates back-EMF and is popular for its disturbance rejection. These work well above some minimum speed (often a few hundred electrical RPM). The fundamental problem is **zero and low speed**: back-EMF is proportional to speed, so near standstill there is almost no signal to observe. The angle estimate becomes garbage exactly when you need to start. ### The startup / zero-speed problem Two common fixes: - **Open-loop / forced commutation start**: ramp a rotating voltage vector to drag the rotor up to a speed where the observer locks in, then switch to closed-loop. Crude, can cause a stutter or backward kick, but fine for fans and pumps. - **High-frequency injection (HFI)**: inject a small high-frequency signal and measure the inductance variation with rotor angle (it only works on **salient** machines, where `Ld ≠ Lq`). This gives true zero-speed sensorless position — it is how some appliance and traction drives start under load without a sensor. > **Rule of thumb**: if you need controllable torque at zero speed (a robot joint, a winch, a stalled actuator), use a position sensor. Sensorless is excellent for spinning loads but is a compromise at standstill. ## PWM, switching, dead-time, and field weakening ### Switching frequency The inverter chops the bus at the **PWM switching frequency**, typically: - **8–20 kHz** for industrial IGBT drives and many BLDC ESCs (often kept below ~20 kHz to limit switching loss; above 20 kHz also pushes it out of the audible band). - **20–60 kHz** for low-voltage MOSFET robotics drives (ODrive, Moteus, VESC commonly run 20–40 kHz). - **>100 kHz** possible with GaN. Higher switching frequency means **lower current ripple** (the inductor sees the chopping less), **higher achievable control bandwidth**, and quieter operation — but **more switching loss** (each transition burns energy in the FET). It is a direct trade. The current control loop usually runs at the PWM rate or half of it (sampling at the PWM peak/trough where current is at its average value). So switching frequency and control-loop rate are linked. ### Dead-time In each leg, you must insert a **dead-time** — a brief window where both high and low switches are off — during the transition, so they are never on together (shoot-through). Typical dead-time is **0.1–2 µs** depending on device speed and gate drive. Dead-time is necessary but harmful: during it, current flows through the body/freewheel diodes and the actual output voltage deviates from what you commanded, by an amount that depends on current direction. This **dead-time distortion** causes low-order harmonics, torque ripple, and current crossover distortion near zero current. Good drives apply **dead-time compensation** — predicting the error from current sign and adding it back to the duty command. > **Rule of thumb**: minimize dead-time to the smallest value your gate drive and FETs can safely tolerate, then compensate the residual in firmware. Excess dead-time is pure distortion. ### Bus voltage and modulation index The **modulation index** is how much of the available DC bus you're using. At 100% (full modulation) you've run out of voltage — the back-EMF plus the IR and L·di/dt drops have consumed the entire bus. Once you hit the voltage ceiling, you cannot push more current at that speed; the current loop **saturates**. With SVPWM the linear ceiling is `V_phase_peak ≤ Vdc/√3` (≈ 0.577·Vdc), vs 0.5·Vdc for sinusoidal PWM. Beyond linear, **overmodulation** squeezes a little more out at the cost of harmonic distortion. ### Field weakening What happens when you want to spin **faster** than the bus voltage allows at Id = 0? The back-EMF grows with speed and eventually equals the bus, leaving no headroom to push current. **Field weakening** drives **Id negative**, creating a stator flux that opposes the rotor magnet flux, reducing the effective back-EMF and letting the motor spin faster — at the cost of torque (you're spending current on flux instead of torque, and the total current is limited). ```text Field-weakening logic (simplified): - Run Id* = 0 until the q-axis voltage demand Vq approaches the bus limit. - As the voltage vector magnitude sqrt(Vd^2 + Vq^2) hits the SVPWM ceiling, command Id* < 0 to reduce back-EMF and free up voltage for Iq. - Respect total current limit: Id^2 + Iq^2 <= Imax^2. ``` Field weakening is how EVs and high-speed spindles get a wide constant-power speed range above base speed. ODrive, VESC, and most industrial drives support it; it demands accurate motor parameters and careful current limiting because a field-weakening fault at speed (e.g., losing control while back-EMF exceeds the bus) can overvoltage the bus. ## Tuning a FOC drive The single best thing about FOC is that the inner loop is **analytically tunable from motor parameters** — you don't have to guess. ### Current-loop gains from R and L Model one axis of the motor as a first-order R–L plant: `V = R·I + L·(dI/dt)`. A PI controller `Kp + Ki/s` regulating this plant has a clean closed-form tuning if you place the PI zero to **cancel the plant pole** (`Ki/Kp = R/L`). Then the closed loop becomes a first-order system with bandwidth `ωc` (rad/s), and: ```text Current-loop PI gains (pole-zero cancellation): Let ωc = desired current-loop bandwidth in rad/s (e.g., bandwidth_Hz * 2*pi; pick ~1/10 of switching freq) Kp = L · ωc // proportional gain (volts per amp) Ki = R · ωc // integral gain (volts per amp-second) Check: Ki/Kp = R/L -> PI zero cancels the motor's electrical pole ``` So if you measure `R = 0.1 Ω`, `L = 50 µH`, and want a 1 kHz current loop (`ωc = 2π·1000 ≈ 6283 rad/s`): `Kp = 50e-6 · 6283 ≈ 0.31 V/A` and `Ki = 0.1 · 6283 ≈ 628 V/(A·s)`. No guessing. This is why ODrive and similar drives ask you to measure R and L first (their calibration routine injects current and identifies both) — then they compute current gains automatically. > **Rule of thumb**: set current-loop bandwidth to roughly **1/10 of the PWM frequency**. At 20 kHz PWM, a ~2 kHz current loop is reasonable. Faster than ~1/5 and you risk instability from sampling and delay. ### Anti-windup When the voltage demand saturates against the bus, the integrator keeps accumulating error it cannot act on — **integral windup** — and when the saturation clears, the wound-up integral causes a big overshoot. Every real PI loop needs **anti-windup**: clamp or back-calculate the integrator so it doesn't accumulate during saturation. In a FOC voltage limiter, when `sqrt(Vd²+Vq²)` exceeds the ceiling, both axis integrators must be held/back-calculated, with the q-axis usually prioritized for torque. ### Autotuning Modern drives automate most of this. **TI InstaSPIN** identifies motor parameters and sets up FOC with minimal user input. **ODrive** runs a motor-calibration sequence (resistance, inductance, encoder offset, pole pairs). Industrial drives from **Copley** and **Elmo** have one-button autotuning that identifies the mechanical plant (inertia, friction, resonances) and sets velocity/position gains, often with notch filters for mechanical resonances. ### The practical bring-up sequence A safe order to bring up a new motor+drive combination: ```text 1. Power stage check at low bus voltage / current limit. Confirm no shoot-through. 2. Motor parameter ID: measure phase resistance R and inductance L. 3. Encoder/sensor calibration: find pole pairs and the electrical angle offset (align rotor to a known phase, record encoder reading). 4. Current (FOC) loop: set Kp/Ki from R, L. Command small Iq, verify smooth torque and correct direction. Watch current waveforms if you can. 5. Velocity loop: with current loop trusted, close velocity. Tune for ~1/5 to 1/10 of current-loop bandwidth. Add inertia feedforward if known. 6. Position loop: close last, slowest. Add velocity feedforward. 7. Set protection limits (I2t, overtemp, overvoltage) BEFORE real loads. 8. Test under representative load, then under fault conditions (e-stop, stall). ``` > **Rule of thumb**: never close the velocity loop until the current loop is verified, and never close the position loop until velocity is solid. Tune from the inside out, always. ## The drive ecosystem: hobby vs industrial The FOC algorithm is the same everywhere. What differs across the market is hardware quality, interfaces, certification, ruggedness, and price. Three tiers: ### Hobby / robotics open ecosystem - **ODrive** — open(-ish) high-performance dual-axis FOC controllers (ODrive 3.6, and the newer ODrive Pro / S1 / Micro). Strong at high-torque robotics and direct-drive joints; encoder-based FOC, CAN, good docs and community. Typical: 12–56 V, tens of amps continuous. - **Moteus (mjbots)** — compact single-axis FOC controller designed for legged/dynamic robots, integrated magnetic encoder, CAN-FD, very high loop rates, sold with matching actuators. Excellent for quadrupeds and dynamic legged machines. - **VESC** — originally an e-skateboard/e-bike controller, now a huge open-source FOC ecosystem (hardware + VESC Tool firmware). Wide voltage/current range, sensorless and sensored, enormous community, many clones (buyer beware on clone power-stage quality). - **SimpleFOC** — an open-source Arduino/STM32 *library* plus reference driver boards. Not a product so much as a way to put real FOC on your own MCU. Great for learning and custom low/medium-power designs; performance depends entirely on your hardware. ### Industrial servo drives - **Copley Controls** — high-end servo drives (Accelnet, Xenus families), EtherCAT/CANopen, excellent tuning tools, strong in semiconductor/medical/automation. - **Elmo Motion Control** — famously tiny, high-power-density "Gold" line servo drives (Gold Solo Whistle, etc.), EtherCAT, aerospace/robotics. - **Kollmorgen** — AKD servo drive family, tightly integrated with their servo motors, industrial automation and robotics. - Others worth knowing: **Beckhoff** (drives + EtherCAT ecosystem), **Trinamic/ADI (TMC)** for integrated stepper/BLDC driver ICs with onboard FOC (e.g., TMC4671 hardware FOC), and **Texas Instruments InstaSPIN** as a chip-level FOC solution. ### Integrated motor + drive A growing category: the drive lives *inside* the motor housing. Examples include mjbots actuators, many collaborative-robot joint modules, and "smart" servo actuators (Dynamixel-class, though those are often simpler control). Benefits: no motor-to-drive wiring (huge for EMI and assembly), compact, calibrated as a unit. Costs: harder to service, thermal coupling between drive and motor, less flexibility. See [robot actuators](/posts/robot-actuators-ultimate-guide/) for the actuator-level view. | Drive | Tier | Typical voltage | Comms | Sensor | FOC | Best for | |---|---|---|---|---|---|---| | ODrive Pro / S1 | Hobby/robotics | 12–56 V | CAN, USB | Encoder (mag/optical) | Yes | High-torque robotics, direct drive | | Moteus (mjbots) | Hobby/robotics | up to ~44 V | CAN-FD | Onboard magnetic | Yes | Legged/dynamic robots | | VESC | Hobby/robotics | ~12–60+ V (variant) | CAN, UART, USB | Sensored + sensorless | Yes | E-mobility, makers, wide range | | SimpleFOC | Library/DIY | Your design | Your choice | Your choice | Yes | Learning, custom designs | | TMC4671 (ADI/Trinamic) | IC | Chip-level | SPI/Step-Dir | Many | Hardware FOC | Embedding FOC in a product | | Copley Accelnet/Xenus | Industrial | up to ~400 V+ | EtherCAT, CANopen | Encoder, resolver | Yes | Automation, semicon, medical | | Elmo Gold | Industrial | wide | EtherCAT, CANopen | Encoder, resolver | Yes | Aerospace, compact high power | | Kollmorgen AKD | Industrial | 120–480 VAC | EtherCAT, etc. | Encoder, resolver | Yes | Industrial servo systems | | TI InstaSPIN (C2000) | IC/SDK | Your design | Your choice | Sensorless (FAST) | Yes | Sensorless products | > **Rule of thumb**: if you're building a robot prototype or a small fleet, the open robotics drives give you 90% of the performance at 10–20% of the cost. If you need certified functional safety, deterministic EtherCAT motion across many axes, and a vendor to call at 2 a.m., pay for an industrial drive. ## Communication and real-time interfaces How does the drive get its commands, and how fast? This matters as much as the control loop, because a perfectly-tuned 20 kHz current loop is useless if commands arrive late or jittery. ### Command interfaces, from simple to deterministic - **Analog torque/velocity command** (±10 V): the old-school servo interface. The drive runs its own loops; an external motion controller feeds an analog setpoint. Simple, fast, but noise-prone and one wire per axis. - **Step/direction**: a pulse train sets position increments (inherited from stepper drives). Common on CNC and lower-end servo drives. Simple, but open-loop in the command path and limited in bandwidth by pulse rate. - **CAN / CAN-FD**: the robotics workhorse. ODrive, Moteus, and VESC all use CAN. Classic CAN tops out at 1 Mbit/s; **CAN-FD** pushes payloads and bitrates much higher (multi-Mbit/s data phase), which is why Moteus uses it for high-rate multi-joint robots. Multi-drop (one bus, many drives), robust, cheap. Not hard-real-time deterministic at the protocol level, but fine for many robots if you manage bus load. - **EtherCAT**: the industrial gold standard for multi-axis motion. Deterministic, sub-microsecond synchronization across dozens of axes via distributed clocks, with cycle times down to tens of microseconds. This is what Copley, Elmo, Kollmorgen, and Beckhoff drives speak. If you need 32 synchronized axes updating every 250 µs, this is the answer. See [industrial automation context in the real-time guide](/posts/real-time-control-systems-ultimate-guide/). - **Ethernet/IP, PROFINET, SERCOS, POWERLINK**: other industrial real-time buses, vendor-dependent. ### Where the loops run and at what rate A crucial architectural question: **which loops live in the drive, and which on the host?** - In most robotics and industrial setups, **all three loops (current, velocity, position) run inside the drive**, at the drive's high internal rate (current 10–40 kHz, velocity kHz, position sub-kHz). The host just streams setpoints (e.g., target position every 1 ms over EtherCAT/CAN). This keeps the fast loops local and deterministic regardless of host jitter. - In some advanced robots, the **outer loop (whole-body control, impedance) runs on the host** at 0.5–2 kHz, streaming torque commands to drives running only the current loop. This demands a low-latency, low-jitter bus (CAN-FD or EtherCAT) and a real-time host. Legged-robot stacks often do this. > **Rule of thumb**: keep the current loop in the drive, always. Push only the loop you can afford to run at the bus rate up to the host, and only if your bus and host are genuinely real-time. For loop timing and determinism, see [real-time control systems](/posts/real-time-control-systems-ultimate-guide/). ## Protection and fault handling A drive that can't protect itself and the motor is a fire and a destroyed gearbox waiting to happen. Protection is where "it works on the bench" becomes "it survives the field." ### Overcurrent - **Hardware overcurrent trip**: a comparator on the current-sense signal that shuts the gates off in *nanoseconds to a microsecond*, independent of firmware. This is the last line of defense against a short or a control fault. Non-negotiable on a real drive. - **Software current limit**: the current loop's setpoint is clamped to `Imax`, so under normal control you never command more than the FETs/motor can take. ### I²t (thermal current limiting) Motors and FETs tolerate **brief overcurrent** but not sustained. **I²t protection** models the heating: it allows, say, 2–3× rated current for a short window (a few seconds) for acceleration, then folds back to the continuous rating. This mirrors the physics — heating is proportional to `I²·t`. A drive without I²t either nuisance-trips on legitimate peaks or cooks the motor on sustained overload. Industrial drives model this carefully; good robotics drives (ODrive, Moteus) expose continuous and peak current limits that approximate it. ### Overtemperature Thermistors or onboard temp sensors on the FETs/heatsink, and ideally a motor thermistor, with foldback or shutdown thresholds. Power FETs derate hard with temperature; a drive that ignores temperature will silently lose capability or fail. ### Overvoltage, regen, and braking When a motor **decelerates** or is back-driven, it acts as a generator and pumps energy *back into the bus*. The bus voltage rises. If nothing absorbs that energy: - A **battery** can usually absorb it (it just charges) — within its charge-current limits. - A **mains-rectified supply** *cannot* sink current backward, so the bus capacitor voltage climbs until something trips or pops. Two solutions: - **Brake (dump) resistor**: a resistor switched across the bus by a "brake chopper" when voltage exceeds a threshold, burning the regen energy as heat. Standard on industrial drives and offered as an option on robotics drives. Size it for your worst-case deceleration energy. - **Regenerative drive**: feeds energy back to the mains or battery. Efficient, more expensive, used in high-power and energy-conscious systems. > **Rule of thumb**: any drive that can decelerate a significant inertia, or hold a back-driven load (a vertical axis, a winch), needs a defined path for regen energy — a battery that can take it, a brake resistor, or a regen front end. "We'll figure out braking later" is how bus capacitors explode. ### Fault handling Beyond trips, a mature drive has defined fault states: encoder loss, phase loss/open, communication timeout (watchdog — if commands stop, stop the motor safely), DC-bus undervoltage, gate-driver fault, and a clean **fault latch** that requires an explicit reset. Safe-torque-off (STO) is a hardware safety input on industrial drives that disables the gates independently of firmware for functional-safety compliance. ## Choosing a controller for your robot Selection is mostly arithmetic and honesty about your loads. Work through it in this order — most bad choices are current/thermal errors made because someone fixated on features first. ### 1. Bus voltage Pick a controller whose voltage range comfortably brackets your supply, including **regen overshoot** headroom. A 48 V system with regen can transiently see 55–60 V; choose a drive (and FETs) rated above that. Higher voltage means lower current for the same power (less I²R loss, thinner wires) but more switching stress and safety concern. ### 2. Current — continuous AND peak This is where selections fail. You need: - **Continuous current** matching your motor's continuous torque demand at the worst sustained operating point (with margin and at realistic temperature, not the datasheet's optimistic figure). - **Peak current** for acceleration and transient torque. A drive's "peak" rating is meaningless without its duration — check the I²t window. ### 3. Sensor support Match the drive to your feedback: incremental/absolute encoder type and protocol (ABI, SPI, SSI, BiSS), Hall, resolver, or sensorless. If you need zero-speed torque, you need an absolute or properly-calibrated sensor (see [encoders](/posts/encoders-ultimate-guide/)). ### 4. Comms CAN/CAN-FD for robotics multi-drop; EtherCAT for deterministic multi-axis industrial; step/dir or analog for simple retrofits; USB/UART for config. Make sure the protocol matches your host stack and update-rate needs. ### 5. Form factor and thermal Board-level vs enclosed, integrated-in-motor vs separate, and crucially **how you'll cool it**. A 40 A drive on a 20 A heatsink is a 20 A drive. Account for ambient, airflow, and duty cycle. | Application | Voltage | Continuous current | Sensor | Comms | Suggested tier/example | |---|---|---|---|---|---| | Quadruped/legged joint | 24–48 V | 10–40 A | Onboard magnetic abs. | CAN-FD | Moteus / ODrive S1 | | Direct-drive robot arm joint | 24–56 V | 20–60 A | Absolute encoder | CAN / EtherCAT | ODrive Pro / Copley | | Mobile-robot drive wheel | 24–48 V | 10–30 A | Hall + encoder | CAN | VESC / ODrive | | E-bike / light EV | 36–72 V | 30–100 A | Hall + sensorless | CAN/UART | VESC (quality HW) | | Industrial multi-axis machine | 230–480 VAC | per axis | Encoder/resolver | EtherCAT | Copley / Elmo / Kollmorgen | | Drone propulsion | 12–52 V (LiPo) | per motor | Sensorless BEMF | DShot/CAN | BLDC ESC (six-step/FOC) | | Embedding FOC in a product | Your design | Your design | Your choice | SPI/CAN | TMC4671 / TI C2000 InstaSPIN | | Precision quiet actuator (low spd) | 12–48 V | 1–10 A | Absolute encoder | CAN | ODrive / TMC4671 | > **Rule of thumb**: size for the *worst-case sustained thermal* operating point, not the catalog peak. Then verify peak/acceleration is covered by the I²t window. Comms and form factor are last — they're easy to get right once the power and sensing are correct. ## Frequently asked questions **What is the difference between an ESC and a FOC controller?** "ESC" (Electronic Speed Controller) usually means a simple, often six-step/trapezoidal BLDC controller for drones and RC, optimized for cheap, high-speed open-loop-ish operation. A FOC controller runs Field-Oriented Control for smooth torque and precise closed-loop behavior. Many modern "ESCs" now run FOC (e.g., some drone ESCs, VESC), so the terms have blurred — the real question is whether the device does vector control with current feedback or simple six-step commutation. **Do I really need FOC, or is six-step good enough?** If your load just needs to spin fast and a little torque ripple doesn't matter — a propeller, a fan, a pump — six-step is cheaper and perfectly fine. If you need smooth, controllable torque, low-speed or zero-speed operation, high efficiency, or quiet running — robot joints, servos, precision actuators — use FOC. Roughness, low-speed control, and torque accuracy are the deciding factors. **Why transform into the dq frame at all — why not just control the phase currents directly?** Because in the stationary frame the target currents are sinusoids that move with the rotor, and PI controllers lag a moving target, losing accuracy as speed rises. The Park transform rotates into the rotor frame where, in steady state, the currents are *constant DC* values. A PI loop nails a DC setpoint with zero steady-state error. That conversion of AC control into DC control is the entire reason FOC exists. **What does Id = 0 mean and when should I not use it?** Id is the current aligned with the rotor's magnet flux; for a surface-PM (non-salient) motor it produces no torque, so you set Id = 0 to put all current into torque (maximum torque per amp). You deviate from Id = 0 in two cases: **field weakening** (Id negative to spin above base speed), and **salient/interior-PM motors** where a small negative Id exploits reluctance torque for true MTPA. **How do I tune the current loop?** Measure motor phase resistance R and inductance L (most drives do this automatically). Then with pole-zero cancellation: `Kp = L·ωc` and `Ki = R·ωc`, where ωc is your target current-loop bandwidth in rad/s (pick roughly 1/10 of the PWM frequency). That gives a clean first-order closed loop with no guessing. Add anti-windup for when the voltage saturates. **Can I run FOC without an encoder?** Yes — sensorless FOC estimates rotor angle from back-EMF using observers (sliding-mode, flux, Kalman) or TI's InstaSPIN FAST estimator. It works well above some minimum speed. The catch is zero and low speed, where back-EMF is too small to observe; you need open-loop forced start or high-frequency injection (HFI, salient motors only). If you need controllable torque at standstill, use a position sensor. **What switching frequency should I use?** Common ranges: 8–20 kHz for IGBT/industrial, 20–40 kHz for low-voltage MOSFET robotics drives, >100 kHz for GaN. Higher means lower current ripple and more control bandwidth but more switching loss. A frequent default is to keep it ≥20 kHz (above audible) and set the current loop at ~1/10 of it. Match it to your motor inductance — low-inductance motors need higher PWM to keep ripple sane. **Why does my motor draw current and get hot but produce no torque?** Almost always a rotor-angle problem: a wrong electrical-angle offset, miscounted pole pairs, swapped encoder direction, or a sensorless observer that hasn't locked. If θe fed to the Park transform is wrong, current goes into the d-axis (or worse) and dissipates as heat without making torque. Re-run encoder/commutation calibration and verify pole pairs. **What is dead-time and why does it matter?** Dead-time is the brief interval (0.1–2 µs) where both switches in an inverter leg are off during a transition, preventing shoot-through (a destructive bus short). It's necessary, but it distorts the output voltage in a current-direction-dependent way, causing harmonics and torque ripple — especially near zero current. Good drives apply dead-time compensation in firmware. Use the minimum safe dead-time and compensate the rest. **What is SVPWM and why is it better than sine PWM?** Space Vector PWM realizes a desired voltage vector by time-averaging the inverter's discrete switching states. Versus naive sinusoidal PWM it uses the DC bus about 15.5% more effectively (peak phase voltage up to Vdc/√3 instead of Vdc/2) by adding a common-mode/third-harmonic offset that cancels in the line-to-line voltages. More usable bus voltage means more speed and torque headroom from the same battery, plus generally lower harmonic distortion. **How do I handle regen / braking energy?** When a motor decelerates or is back-driven it pumps energy into the DC bus, raising its voltage. A battery can usually absorb it within its charge limits; a rectified mains supply cannot, so you need a brake (dump) resistor with a chopper to burn the energy, or a regenerative front end to return it. Any drive moving significant inertia or holding a back-driven load needs a defined regen path, sized for worst-case deceleration energy, or the bus capacitor will overvoltage. **ODrive vs Moteus vs VESC — which should I pick?** Roughly: **Moteus** for legged/dynamic robots needing compact single-axis drives with onboard encoders and CAN-FD at high rates. **ODrive** for higher-torque robotics, direct-drive joints, and dual-axis applications with strong docs. **VESC** for e-mobility and the widest open-source community and voltage/current flexibility (but vet clone hardware quality). All three do real FOC; the choice is about form factor, current/voltage range, comms, and ecosystem fit rather than the control algorithm. ## Changelog - **2026-06-09** — Initial publication. --- # Reinforcement Learning for Robotics: The Ultimate Guide URL: https://blog.robo2u.com/posts/reinforcement-learning-robotics-ultimate-guide/ Published: 2026-06-08 Updated: 2026-06-20 Tags: reinforcement-learning, rl, sim-to-real, robot-learning, policy-optimization, domain-randomization, locomotion, manipulation, guide Reading time: 36 min > A 2026 engineering guide to reinforcement learning for robots — PPO/SAC/TD3, massively parallel sim in Isaac Lab, domain randomization, teacher-student sim-to-real, reward hacking, and deploying ONNX policies on real hardware. Around 2019 a quadruped from ETH Zurich learned to walk in a simulator and then walked, on the first try, on real grass. No one hand-tuned a gait. No one wrote a state machine for stance and swing. A neural network mapped joint angles and a body-velocity command to twelve joint targets, and the gait — the whole coordinated mess of contact, recovery, and balance — fell out of an optimization that ran for a few hours on one GPU. That result, and the dozens that followed it, is why every serious legged-robot and humanoid team in 2026 has an RL person, and why a lot of classical-controls people are nervously learning PyTorch. This guide is the long version for the engineers who actually build these systems: the controls person who wants to know why PPO beats their carefully tuned MPC on rough terrain, the ML person who can train a policy in sim but can't get it to survive contact with a real robot, and the advanced maker who has read the ANYmal papers and wants the recipe. We go end to end: why RL suits contact-rich robotics at all, the MDP fundamentals, the three or four algorithms that actually work on hardware, the massively-parallel sim-to-real pipeline, domain randomization, the reward-hacking trap, imitation learning, the teacher-student recipe that made legged RL reliable, the landmark results, when RL beats classical control and when it absolutely does not, and how you get a trained policy running at 50 Hz on onboard compute without it blowing up. **The take**: RL is not a replacement for control theory — it is a *compiler* that turns a reward function and a good simulator into a reactive feedback policy for problems where you can't write the controller by hand. It wins decisively on contact-rich, hard-to-model, high-dimensional tasks (legged locomotion, dexterous manipulation, whole-body humanoid control) and loses to MPC and trajectory optimization on well-modeled, accuracy-critical, low-dimensional tasks (a 6-axis arm tracing a weld seam). The 2026 frontier is not the algorithm — PPO has barely changed since 2017 — it is the simulator, the randomization, and the sim-to-real bridge. Get those wrong and the fanciest algorithm gives you a policy that walks beautifully in sim and falls over on the floor. Companion reading: [robot simulation & digital twins](/posts/robot-simulation-digital-twin-ultimate-guide/), [legged & quadruped robot hardware](/posts/legged-quadruped-robot-hardware-ultimate-guide/), [humanoid robot hardware](/posts/humanoid-robot-hardware-ultimate-guide/), and [real-time control systems](/posts/real-time-control-systems-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [Why RL for robots at all](#why-rl) 3. [RL fundamentals: the MDP, reward, policy, value, return](#fundamentals) 4. [The algorithm landscape: model-free vs model-based, on- vs off-policy](#landscape) 5. [The algorithms that actually work on robots](#algorithms) 6. [The sim-to-real pipeline](#sim-to-real) 7. [Domain & dynamics randomization](#randomization) 8. [Reward shaping and the reward-hacking trap](#reward) 9. [Imitation learning: BC, DAgger, and how it complements RL](#imitation) 10. [Teacher-student & privileged learning](#teacher-student) 11. [Landmark results: legged, dexterous, humanoid](#landmarks) 12. [Learned vs classical control](#learned-vs-classical) 13. [Deploying a policy](#deploy) 14. [On-robot fine-tuning, safety & limitations](#safety) 15. [Data & compute budget](#budget) 16. [Frequently asked questions](#faq) ## Key takeaways - **RL earns its keep on contact-rich, hard-to-model, high-dimensional problems.** Legged locomotion, dexterous in-hand manipulation, and whole-body humanoid control all involve discontinuous contact dynamics that are painful to model and to control analytically. RL learns a reactive policy directly from simulated experience and sidesteps the modeling problem. - **PPO dominates parallel-sim locomotion** not because it is the most sample-efficient algorithm but because it is the most *robust* one. It tolerates bad hyperparameters, scales cleanly to tens of thousands of parallel environments, and rarely diverges. On a single GPU running Isaac Lab you can collect billions of simulated steps in hours, so PPO's sample-hunger stops mattering. - **SAC and TD3 are the sample-efficient off-policy alternatives** for continuous control. Use them when environment steps are expensive — single-environment sim, or real-robot fine-tuning — where PPO's appetite for fresh on-policy data is fatal. SAC's entropy regularization makes it the safer default of the two. - **The simulator is the product.** Sim-to-real transfer succeeds or fails on simulator fidelity, randomization, and the observation design — not on the RL algorithm. Teams obsess over PPO clip ratios when the real bug is an actuator model that ignores motor delay. See [robot simulation & digital twins](/posts/robot-simulation-digital-twin-ultimate-guide/). - **Massively parallel simulation changed the economics.** Isaac Gym and now Isaac Lab run thousands of robot instances on a single GPU at hundreds of thousands of steps per second. A legged-locomotion policy that took days on CPU clusters in 2018 trains in well under an hour in 2026. - **Domain randomization is the bridge, not a trick.** Randomize masses, friction, latency, motor gains, terrain, and sensor noise during training and the policy learns a controller robust to the *distribution* of plausible real robots — which includes the actual one. Randomize too little and it overfits sim; too much and it learns nothing. - **Reward hacking is the default failure mode, not an edge case.** Any exploitable gap between what you reward and what you want, the optimizer will find. Budget more time for reward debugging than for algorithm tuning. - **Teacher-student / privileged learning is the legged-robot recipe.** Train a teacher with access to privileged state (true friction, contact forces, terrain height) it could never measure on hardware, then distill it into a student that uses only onboard sensors and a short history. This decouples "learn the skill" from "learn to perceive." - **Imitation learning complements RL; it rarely replaces it.** Behavior cloning gives you a warm start or a reference style; DAgger fixes the compounding-error problem of pure BC; RL then optimizes for robustness and performance the demonstrations never showed. - **RL beats MPC when the model is bad or the contacts are many; MPC beats RL when the model is good and accuracy is contractual.** Don't put a learned policy on a robot tracing a weld seam to 0.1 mm. Don't put MPC alone on a quadruped sprinting over rubble. See [motion planning & kinematics](/posts/motion-planning-kinematics-ultimate-guide/). - **Deployment is an embedded-systems problem.** Export to ONNX, optionally compile with TensorRT, run inference at the control rate (50–100 Hz for locomotion, up to 1 kHz for some manipulation), and respect the [real-time control loop](/posts/real-time-control-systems-ultimate-guide/). A 2-layer MLP policy runs in tens of microseconds on a modern CPU; you do not need a GPU on the robot for most locomotion policies. - **On-robot fine-tuning is risky and usually unnecessary.** Most 2026 production stacks train fully in sim and deploy frozen. If you must fine-tune on hardware, fence it with hard safety limits and an outer classical controller that can take over. - **Compute budgets are modest by LLM standards.** A flagship legged-locomotion policy is a few-million-parameter MLP trained for a few GPU-hours; a dexterous-manipulation policy might be a few-GPU-days. The expensive resource is engineer time spent on reward and sim fidelity, not FLOPs. ## Why RL for robots at all Classical control is extraordinary at what it does well. Give a controls engineer a well-modeled, low-dimensional, smooth system — a motor, a drone, a 6-axis arm in free space — and they will give you a controller with provable stability, predictable behavior, and microsecond latency. For those problems, reaching for RL is engineering malpractice. It is slower to develop, harder to certify, and worse on the metrics that matter. RL earns its place where three conditions stack up. **Contact is everywhere and it's discontinuous.** A foot striking the ground, a finger rolling a cube, a hand wedging a peg into a hole — contact makes the dynamics hybrid and non-smooth. The equations of motion switch as contacts make and break, friction cones clip forces, and small state changes flip the system between regimes. Gradient-based controllers built on a single smooth model struggle; the model is right only between contact events. RL doesn't need a unified analytic model — it learns from rollouts that already contain all the contact transitions. **The dynamics are hard to model accurately.** Series-elastic and quasi-direct-drive actuators have their own dynamics. Cables stretch, gears have backlash and friction, soft feet deform, payloads shift. You can spend a year identifying a model that's still wrong by 20%. RL with randomization learns a policy robust to a *family* of models, which is more honest about how poorly you actually know the robot. **The dimensionality and the desired behavior are high.** A humanoid has 25-50 actuated joints. The behavior you want — walk, recover from a shove, climb stairs, carry a box — is not a setpoint; it's an emergent, context-dependent coordination of all those joints. Writing that by hand is a state-machine nightmare. RL produces *emergent gaits*: behaviors no one specified, discovered because they maximize reward. The ANYmal trot, gallop, and recovery behaviors were never coded — they emerged. > **Rule:** Choose RL when you cannot write the controller by hand *and* you can write a reward and build a good-enough simulator. If either of those is false, RL is the wrong tool. The flip side: RL gives up the things classical control gives you for free — stability guarantees, interpretability, sample-free design, and tight accuracy. You trade analyzability for the ability to solve problems analysis can't reach. On a quadruped scrambling over rubble that's a great trade. On a precision arm it's a terrible one. ## RL fundamentals: the MDP, reward, policy, value, return Strip away the deep-learning machinery and RL is the theory of sequential decision-making under uncertainty. The frame is the **Markov Decision Process (MDP)**: states `s`, actions `a`, a transition model `P(s'|s,a)`, a reward function `r(s,a)`, and a discount factor `γ ∈ [0,1)`. For a robot, the **state** is whatever the policy gets to see — joint positions and velocities, base orientation and angular velocity from the IMU, the velocity command, maybe a history of past observations and actions, maybe an exteroceptive terrain map. The **action** is the policy output — almost always target joint positions for a downstream PD controller, sometimes torques directly (rarely, because it's harder to make safe). The **transition** is the simulator's physics step. The **reward** is your scalar encoding of "good behavior." The agent's goal is to maximize **expected return** — the discounted sum of future reward: ``` G_t = Σ_{k=0}^{∞} γ^k · r_{t+k} Objective: J(π) = E_{τ~π} [ Σ_t γ^t · r(s_t, a_t) ] ``` The discount `γ` (typically 0.99 for locomotion, meaning a horizon of roughly 1/(1−γ) = 100 steps) trades immediate against future reward and keeps the infinite sum finite. Three functions carry all the weight: - **Policy `π(a|s)`** — the controller. Maps state to a distribution over actions. For continuous control it's usually a Gaussian whose mean is a neural network output and whose variance is a learned (often state-independent) parameter. At deployment you take the mean — deterministic. - **State-value `V^π(s)`** — expected return from state `s` under policy `π`. "How good is it to be here?" - **Action-value `Q^π(s,a)`** — expected return from taking action `a` in state `s`, then following `π`. "How good is this move?" The **advantage** `A(s,a) = Q(s,a) − V(s)` measures how much better an action is than the policy's average — the single most useful quantity in policy-gradient methods, because it tells you which actions to make more or less likely: ``` A^π(s_t, a_t) = Q^π(s_t, a_t) − V^π(s_t) # In practice, estimated with Generalized Advantage Estimation (GAE): δ_t = r_t + γ·V(s_{t+1}) − V(s_t) # TD residual Â_t = Σ_{l=0}^{∞} (γλ)^l · δ_{t+l} # λ ∈ [0,1] trades bias vs variance ``` GAE's `λ` (commonly 0.95) is the bias-variance knob: `λ=0` is low-variance, high-bias one-step TD; `λ=1` is high-variance, unbiased Monte Carlo. Most locomotion recipes sit at 0.95. > **Rule:** If your robot's observation isn't Markov — if the optimal action depends on history the policy can't see — either add a short observation history (stack the last N frames) or use a recurrent policy. A purely reactive MLP on a non-Markov observation will plateau and you'll blame the algorithm. **The Markov assumption is where real robots bite you.** Motor delay, sensor latency, and unobserved terrain all break the clean MDP. The standard fixes — stacking observation history, adding an actuator-delay model in sim, or using the teacher-student trick below — are really about restoring enough state that the problem becomes Markov again. ## The algorithm landscape: model-free vs model-based, on- vs off-policy Two axes organize the whole field, and knowing where an algorithm sits tells you most of what you need. **Model-free vs model-based.** Model-free methods (PPO, SAC, TD3, DDPG) learn a policy and/or value function directly from experience without ever learning the transition dynamics. Model-based methods (Dreamer, MBPO, TD-MPC2) learn a model of the world and plan or generate imagined rollouts inside it. Model-based methods are far more sample-efficient — they squeeze more learning from each real interaction — which matters enormously if your data comes from a real robot. But when you have a fast simulator, the data is nearly free, and the extra complexity and instability of learning a model often isn't worth it. **In 2026, simulated robotics is overwhelmingly model-free; real-world-only learning is where model-based methods shine.** **On-policy vs off-policy.** On-policy methods (PPO, A2C, TRPO) can only learn from data collected by the *current* policy; after each update the old data is stale and discarded. Off-policy methods (SAC, TD3, DDPG, Q-learning) learn from a replay buffer of past experience, including data from old policies. Off-policy is dramatically more sample-efficient because every transition can be reused many times. On-policy is more stable and parallelizes beautifully. The practical consequence is the central trade of the field: - **Cheap data (massively parallel sim):** use on-policy PPO. Sample inefficiency is irrelevant when you generate 200,000 steps/second. - **Expensive data (single sim, real robot, slow sim):** use off-policy SAC or TD3. You can't afford to throw experience away. This is why nearly every published legged-locomotion result uses PPO and nearly every sample-efficiency benchmark and real-robot-learning paper uses SAC. They're solving the same RL problem under opposite data economics. ## The algorithms that actually work on robots Four model-free continuous-control algorithms cover ~95% of real robotics RL. Their lineage matters: DDPG begat TD3 begat the off-policy family; TRPO begat PPO. Here's the comparison that I'd tape to the wall. | Algorithm | Type | Sample efficiency | Stability / robustness | Parallelism | Best use on robots | |---|---|---|---|---|---| | **PPO** | On-policy, model-free | Low (needs lots of steps) | High — very forgiving | Excellent (10k+ envs) | Locomotion, humanoid, anything in massively parallel sim | | **SAC** | Off-policy, model-free | High | High (entropy-regularized) | Moderate | Sample-limited continuous control, real-robot fine-tune, manipulation | | **TD3** | Off-policy, model-free | High | Medium (tuning-sensitive) | Moderate | Sample-limited deterministic control where SAC's entropy isn't wanted | | **DDPG** | Off-policy, model-free | Medium | Low — brittle | Moderate | Mostly historical; use TD3 or SAC instead | ### PPO — why it dominates parallel-sim locomotion Proximal Policy Optimization is a policy-gradient method that improves the policy while preventing each update from changing it too much. The "proximal" part is a clipped surrogate objective: it computes the ratio between the new and old policy probabilities for each action and clips it to `[1−ε, 1+ε]` (ε ≈ 0.2), so a single update can't lurch the policy into a region where the advantage estimates are no longer valid. ``` r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) # probability ratio L_CLIP(θ) = E_t [ min( r_t(θ)·Â_t, clip(r_t(θ), 1−ε, 1+ε)·Â_t ) ] ``` That clipping is the whole reason PPO dominates. It makes the algorithm robust to bad hyperparameters and large, noisy advantage estimates — exactly the conditions you get when you run 4,096 parallel environments and dump a giant heterogeneous batch into one update. PPO almost never diverges, which on a project where a training run costs real wall-clock time is worth more than theoretical sample efficiency. The pairing with massively parallel sim is the key insight. PPO is on-policy and sample-hungry, which sounds disqualifying — until you note that a single RTX-class GPU running Isaac Lab can step tens of thousands of robots simultaneously. The sample-inefficiency that kills PPO on a real robot is a non-issue when the simulator hands you billions of steps for free. The ETH legged-locomotion line of work, the Isaac Gym ANYmal results, and most Unitree and humanoid locomotion policies are PPO. It is, frankly, boring and reliable, and that's the point. ### SAC and TD3 — sample-efficient continuous control When environment steps are expensive, you switch to off-policy methods that learn from a replay buffer. **Soft Actor-Critic (SAC)** adds an entropy bonus to the objective — it maximizes reward *and* policy randomness — which drives systematic exploration and makes the algorithm remarkably robust to hyperparameters. It learns two Q-functions (taking the min to fight overestimation), a stochastic policy, and auto-tunes the entropy temperature. SAC is my default for anything sample-limited: manipulation in a single sim, real-robot learning, or fine-tuning. It tolerates a wide range of reward scales and tends to "just work." **Twin Delayed DDPG (TD3)** is the deterministic-policy counterpart. It fixes DDPG's notorious Q-value overestimation with three tricks: twin critics (take the min), delayed policy updates (update the policy less often than the critics), and target-policy smoothing (add noise to target actions). TD3 is excellent and slightly more sample-efficient than SAC on some tasks, but it's more sensitive to exploration-noise tuning because its policy is deterministic. Choose TD3 over SAC when you specifically want a deterministic policy and you're willing to tune the exploration noise. **DDPG** is the ancestor. It works, but it's brittle and easy to destabilize; in 2026 there's no reason to start a project on DDPG when TD3 and SAC exist. > **Rule of thumb:** Parallel sim → PPO. Sample-limited → SAC (default) or TD3. Real-robot-only with no sim → consider model-based (Dreamer/TD-MPC2) or SAC with a very small step budget. If you're unsure, start with PPO in sim; it's the one most likely to give you a working policy on the first serious attempt. ## The sim-to-real pipeline Almost no successful robot RL in 2026 learns on the real robot. The data is too slow, too expensive, and too dangerous to collect. The dominant paradigm is **train in simulation, transfer to reality** — and the engineering is mostly in the transfer. The pipeline, end to end: 1. **Build the digital twin.** Accurate URDF/MJCF, mass and inertia from CAD, joint limits, and — critically — an *actuator model* that captures the real motor's torque-speed curve, delay, and PD behavior. This is the single highest-leverage step. See [robot simulation & digital twins](/posts/robot-simulation-digital-twin-ultimate-guide/). 2. **Massively parallel rollout.** Spin up thousands of randomized environment instances on GPU (Isaac Lab / Isaac Gym, MuJoCo MJX, or Brax). Collect experience at hundreds of thousands of steps per second. 3. **Train the policy** with PPO (typically), with domain randomization active from step one. 4. **Validate in sim** across held-out randomization ranges and edge cases the training distribution didn't emphasize. 5. **Export and deploy** the frozen policy (ONNX → optionally TensorRT) onto onboard compute, running at the control rate. 6. **Close the loop on hardware** — log everything, compare sim vs real trajectories, and feed the gap back into the simulator (the "real-to-sim" correction). ``` # Wall-time intuition for a legged-locomotion PPO run. # Target total experience: ~2e9 simulation steps (2 billion) # Throughput (Isaac Lab, 1 GPU): ~2e5 steps / second (4096 envs) # # wall_time = 2e9 / 2e5 = 1e4 seconds ≈ 2.8 hours # # Manipulation with a smaller env count (~512) and heavier sim # might run 1e4 steps/s -> 2e9 / 1e4 = 2e5 s ≈ 2.3 days. # Throughput, not algorithm choice, sets your wall clock. ``` The numbers are the headline: **the same locomotion task that took days on 2018-era CPU clusters now trains in a couple of hours on one GPU**, because Isaac Gym/Lab moved the entire RL loop — physics, observation assembly, reward, and policy inference — onto the GPU and eliminated the CPU-GPU transfer bottleneck that capped earlier frameworks. > **Rule:** Spend your first week on the actuator model and the observation design, not on the algorithm. A policy trained against an actuator model that ignores motor delay will oscillate or fall on the real robot no matter how good your PPO config is. ## Domain & dynamics randomization The reason a sim-trained policy survives reality is that you never trained it on *the* simulation — you trained it on a *distribution* of simulations. **Domain randomization (DR)** perturbs the simulator's parameters every episode so the policy must work across a range of conditions. If the real robot's true parameters fall inside that range, the policy treats reality as just another sample it has already seen. There are two flavors. **Dynamics randomization** perturbs physics — masses, friction, motor gains, latency. **Visual domain randomization** perturbs the appearance for vision-based policies — textures, lighting, camera pose. Legged locomotion leans on the former; vision-based manipulation needs both. | Technique | What it randomizes | Why it bridges the gap | Typical range | |---|---|---|---| | **Mass / inertia DR** | Link masses, payload, CoM offset | Robot's real mass is never exactly the CAD value; payloads vary | ±10–30% of nominal | | **Friction DR** | Ground & joint friction coefficients | Surfaces and joints differ wildly; the biggest sim-real gap for feet | 0.4–1.25 (foot-ground μ) | | **Actuator / motor-gain DR** | PD gains, torque limits, motor strength | Real gains drift; gearboxes lose efficiency over time | ±10–25% | | **Latency / delay DR** | Observation and action delay | Real control loops have 1–20 ms latency sim ignores by default | 0–40 ms | | **Sensor-noise DR** | IMU drift/noise, joint-encoder noise | Real sensors are noisy and biased | Gaussian, robot-specific σ | | **Push / disturbance injection** | Random external forces on the base | Teaches recovery; produces robust balance | impulses every few seconds | | **Terrain randomization** | Slopes, stairs, gaps, roughness (curriculum) | Generalizes locomotion beyond flat ground | progressive difficulty | | **Visual DR** | Textures, lighting, distractors, camera pose | Closes the appearance gap for vision policies | wide, task-dependent | The failure modes sit at both extremes. **Too little randomization** and the policy overfits to the simulator's quirks — it exploits a friction value or contact model that doesn't exist in reality and falls over on the real floor. **Too much randomization** and the policy can't find any behavior that works across the whole insane range, so it learns a timid, conservative, low-performance controller — or nothing at all. Tuning the ranges is the real art, and **automatic domain randomization (ADR)**, where the ranges expand only as the policy masters the current ones, was a major piece of OpenAI's dexterous-hand result. > **Rule:** Randomize the parameters you're *uncertain* about, proportional to your uncertainty. You know your link lengths to a millimeter — don't randomize them much. You barely know your foot-ground friction — randomize it hard. DR is a way of injecting your honest model uncertainty into training. A complementary technique is **system identification**: measure the real robot to narrow the randomization ranges around the truth, then randomize around *that*. The best pipelines do both — identify what you can measure, randomize what you can't. ## Reward shaping and the reward-hacking trap The reward function is where you specify *what* you want; the policy decides *how*. This separation is RL's superpower and its sharpest knife. The optimizer is a literal genie: it maximizes exactly what you wrote, not what you meant. A locomotion reward is typically a weighted sum of many terms — a "task" term (track the commanded velocity) plus a pile of "regularization" terms (penalize energy, joint-limit violations, body height deviation, foot slip, action rate, orientation tilt). Each term has a weight you tune. Getting the *relative* weights right is most of the work. **Reward hacking** is when the policy finds a high-reward behavior that satisfies your function but violates your intent. Real examples from real projects: - A locomotion policy that **vibrates a foot rapidly against the ground** because the reward credited "contact" without penalizing wasteful motion. - A policy that **exploits a simulator bug** — sticking a foot through the floor, or harvesting energy from a contact-impulse glitch — because the sim physics permitted free reward. - A reaching policy that **knocks the target off the table** so it can never fail to "not be far from it," or learns to **hover near a sparse-reward trigger** without completing the task. - A walking policy that **falls forward in a controlled way** to maximize forward velocity for a moment, because the episode-termination penalty was too small to discourage it. The defenses: - **Penalize the means, not just reward the ends.** Add energy, smoothness (action-rate), and joint-limit penalties. Most "natural-looking" gait reward is really these regularizers doing their job. - **Use termination conditions as hard constraints.** Falling, self-collision, or limit violation should end the episode with a penalty — much more reliable than trying to express "don't fall" as a soft reward term. - **Watch the rendered rollouts, every time.** Numbers lie; video doesn't. Half of reward bugs are obvious the instant you watch the policy. - **Curriculum and command sampling.** Start easy (low velocities, flat ground) and increase difficulty so the policy doesn't find a degenerate early solution and lock in. > **Rule:** Budget more time for reward design and debugging than for picking and tuning the algorithm. The algorithm is a solved commodity; your reward is a bespoke specification with bugs in it. Assume reward hacking is happening and go looking for it. A note on **sparse vs dense reward.** Sparse reward (1 for success, 0 otherwise) is honest — it can't be gamed by definition — but it's nearly impossible to learn from on hard tasks because the policy rarely stumbles onto success. Dense (shaped) reward learns fast but invites hacking. The pragmatic answer is dense reward built carefully, plus sparse success metrics you track separately to detect when dense-reward optimization has drifted from what you actually want. ## Imitation learning: BC, DAgger, and how it complements RL Sometimes you have demonstrations — teleoperated grasps, motion-capture of a human walking, an existing MPC controller's trajectories. **Imitation learning** turns demonstrations into a policy, and it's a powerful complement to RL. **Behavior cloning (BC)** is supervised learning: collect (state, expert-action) pairs and train the policy to predict the expert's action. It's simple, stable, and fast. Its fatal flaw is **compounding error / covariate shift**: the policy makes a small mistake, drifts into a state the expert never visited, has no idea what to do there, makes a bigger mistake, and spirals. A BC policy is only as good as its coverage of the states it will actually encounter. **DAgger (Dataset Aggregation)** fixes covariate shift by iterating: run the current policy, collect the states *it* visits, ask the expert to label the correct action in those states, add them to the dataset, retrain. Over rounds the dataset comes to cover the policy's own state distribution and the compounding-error problem largely goes away. The catch is you need an expert you can query on-demand — easy if the expert is an MPC controller or a privileged-state teacher, harder if it's a human. How they complement RL: - **Warm-starting.** BC the policy from demonstrations, then refine with RL. The policy starts in a reasonable region instead of flailing randomly, which is huge on tasks where random exploration almost never finds reward. - **Style and reference.** Motion-capture clips give a humanoid a human-like gait reference; RL then makes it robust. (Adversarial-motion-priors and similar methods reward the policy for looking like the reference distribution.) - **The teacher-student recipe (next section) is itself a form of imitation** — the student is DAgger-distilled from the teacher. > **Rule:** Use imitation to get into the right neighborhood; use RL to make it robust. Pure BC rarely survives contact with a real robot's distribution shift; pure RL from scratch wastes enormous compute exploring states demonstrations could have handed you for free. ## Teacher-student & privileged learning This is the single most important practical recipe in legged RL, and it's worth understanding precisely because it solves the perception problem that naïve sim-to-real ignores. The problem: in simulation you know *everything* — the exact friction under each foot, the true contact forces, the terrain height around the robot, the disturbance pushing the base. On the real robot you know almost none of that; you have noisy joint encoders, an IMU, and maybe a depth camera. A policy trained on privileged simulator state will be brilliant in sim and useless in reality because its inputs don't exist on the hardware. The solution is a two-stage **teacher-student** pipeline (the ETH Zurich / Hutter lab "learning by cheating" recipe): **Stage 1 — train the teacher.** Train a policy with RL (PPO) that gets full privileged state as input: true friction, contact states, terrain map, external forces. Because its inputs are clean and complete, the teacher learns an excellent policy fast. It could never run on the real robot — that's fine, it's not meant to. **Stage 2 — distill the student.** Train a student policy that uses *only* deployable observations — proprioception (joint angles/velocities, IMU) plus a short history of past observations and actions — to imitate the teacher's actions via supervised learning / DAgger. The history is the key: the student learns to *infer* the privileged information (am I on ice? did something just push me?) from the recent time series of what it can actually measure. This is implicit state estimation, learned end-to-end. The result is a student that matches teacher performance using only onboard sensors. ANYmal's robust blind locomotion over rough terrain (Lee et al., *Science Robotics*, 2020) was exactly this: a teacher with terrain knowledge, distilled into a proprioception-only student that walked over rubble, mud, snow, and stairs it couldn't see, by feeling the terrain through its legs. > **Rule:** When the gap between sim-available and robot-available information is large, don't try to train one policy to do everything. Split it: a teacher that learns the skill with cheating inputs, and a student that learns to perceive well enough to execute it. Decoupling "learn the skill" from "learn to perceive" is why this works. Variants add an explicit **belief encoder** or a recurrent student, and a related family — RMA (Rapid Motor Adaptation) — trains an adaptation module that estimates a latent "environment embedding" online, achieving the same robustness with a slightly different architecture. The common thread is: learn online estimation of the unobservable, using a history of the observable. ## Landmark results: legged, dexterous, humanoid Three lines of work define what RL can do on real robots, and they're the case studies every practitioner should know. ### Legged locomotion (ETH Zurich, Hutter lab; ANYmal) The ANYmal program turned legged RL from a curiosity into a deployable technology. The 2019 *Science Robotics* result (Hwangbo et al.) trained control policies in sim with a learned actuator model — a neural net mapping commanded to realized torque, capturing the series-elastic actuators' dynamics — and transferred them to the real ANYmal, achieving faster, more robust locomotion and a dynamic recovery-from-fall behavior that classical methods struggled with. The 2020 follow-up (Lee et al.) added the teacher-student recipe for **blind** rough-terrain locomotion. The throughline: a learned actuator model plus randomization plus teacher-student made sim-to-real reliable, and it's now the standard recipe across the industry. See [legged & quadruped robot hardware](/posts/legged-quadruped-robot-hardware-ultimate-guide/). The Isaac Gym era (2021 onward) collapsed training time: the "Legged Gym" / RSL-RL stack trains an ANYmal or Unitree-class quadruped locomotion policy in minutes to a couple of hours on one GPU. This is what made RL locomotion accessible to small teams. ### Dexterous manipulation (OpenAI; Dactyl / Rubik's Cube) OpenAI's Dactyl trained a Shadow Hand to reorient a block, and later to manipulate a Rubik's Cube one-handed, entirely in sim with PPO and massive domain randomization. The 2019 Rubik's Cube result introduced **automatic domain randomization (ADR)** — automatically expanding the randomization ranges as the policy improved — which produced a policy robust enough to handle a real hand wearing a rubber glove, with fingers tied together, and other perturbations it never saw in training. The lesson: extreme randomization + ADR can bridge a very hard manipulation gap, but it cost enormous compute (thousands of years of simulated experience). Dexterous manipulation remains far less sample-friendly than locomotion because contact-rich finger-object interaction is harder to simulate accurately. ### Humanoid walking (Unitree, and the 2024-2026 wave) The humanoid surge brought the legged recipe to bipeds. Unitree's H1/G1 and a wave of humanoid programs use PPO-trained locomotion policies, often with motion-capture references (adversarial motion priors / DeepMimic-style style rewards) to get human-like gaits, plus the teacher-student and randomization machinery from the quadruped world. Bipedal balance is less forgiving than quadrupedal — smaller support polygon, higher CoM — so the disturbance-rejection and recovery behaviors matter more, and the sim actuator and contact fidelity bar is higher. The 2024-2026 humanoid demos walking, climbing stairs, and recovering from shoves are overwhelmingly RL locomotion stacks. See [humanoid robot hardware](/posts/humanoid-robot-hardware-ultimate-guide/). > **Pattern across all three:** the algorithm (PPO) is the *least* interesting part. The wins came from the actuator model, the randomization strategy, and the teacher-student / privileged-learning structure. Copy those, not just the optimizer. ## Learned vs classical control This is the question every team argues about, so let's be concrete. Classical control here means model-based methods — PID, LQR, and especially **Model Predictive Control (MPC)**, which optimizes a control sequence over a receding horizon against a dynamics model in real time. RL means a policy trained offline and run as a fast feedforward map. | Dimension | RL policy | MPC / classical | |---|---|---| | **Model requirement** | Needs a good *simulator*, not an analytic model | Needs an accurate *online* dynamics model | | **Contact-rich dynamics** | Excellent — learns through contact | Hard — contact makes online optimization expensive/brittle | | **Online compute** | Tiny — one forward pass (10s of µs) | Heavy — solve an optimization every control step | | **Reactivity / latency** | Constant, low latency | Depends on solver convergence; can spike | | **Accuracy / precision** | Approximate; no guarantees | High; can hit tight tolerances | | **Stability guarantees** | None (empirical robustness only) | Provable (within model validity) | | **Interpretability** | Low — a black-box net | High — you can read the cost and constraints | | **Constraint handling** | Soft, via reward (can be violated) | Hard, explicit constraints respected | | **Adaptation to new task** | Retrain | Re-tune cost/constraints (often faster) | | **Development cost** | High up front (sim + reward + training) | High expertise, but well-trodden | When **RL wins**: the dynamics are hard to model online, contacts are numerous and discontinuous, the state/action space is high-dimensional, and you want a reactive policy with constant tiny latency. Legged locomotion over unknown terrain, dexterous in-hand manipulation, whole-body humanoid control, recovery from disturbances. MPC struggles here because solving a contact-rich optimization at 1 kHz is brutal and the model is wrong anyway. When **MPC/classical wins**: the model is good, the task is accuracy-critical, constraints are hard and must never be violated, and you need stability guarantees or certification. A 6-axis arm tracing a weld seam to 0.1 mm, a CNC-like motion, a drone trajectory in free space, anything safety-rated. RL's lack of guarantees and its soft constraints are disqualifying here. See [motion planning & kinematics](/posts/motion-planning-kinematics-ultimate-guide/) for the classical manipulation stack. The honest 2026 answer is **hybrid**. The strongest legged systems use RL for the reactive low-level policy and classical methods for high-level planning, footstep selection, or as a safety supervisor. MPC can generate references that RL tracks robustly; RL can warm-start or replace the parts of an MPC stack that the model handles badly. Treating it as RL-vs-MPC religious war misses that they're tools for different layers. > **Rule:** If you can write a good dynamics model and the task demands accuracy or guarantees, use MPC. If the dynamics are dominated by hard-to-model contact and you need robustness over precision, use RL. Most real robots want both, at different layers of the stack. ## Deploying a policy A trained policy is a pile of weights in a checkpoint. Getting it onto a robot, running reliably in the [real-time control loop](/posts/real-time-control-systems-ultimate-guide/), is an embedded-systems job that ML people routinely underestimate. **Inference rate.** The policy runs inside the control loop, so it must produce an action every control period. Typical rates: - **Locomotion:** policy at 50-100 Hz, outputting target joint positions, with a downstream PD controller running faster (200 Hz-1 kHz) to track them. This two-rate structure is standard — the policy sets targets, a stiff joint-level controller does the fast tracking. - **Manipulation:** 30-60 Hz for vision-conditioned policies (camera-bound), up to several hundred Hz for proprioceptive contact-rich control. **Export path.** Train in PyTorch, then export to **ONNX** for a framework-independent, dependency-light artifact. On NVIDIA onboard compute (Jetson Orin), compile the ONNX to **TensorRT** for lower latency and FP16/INT8 if you need it. For CPU deployment, ONNX Runtime is plenty fast for small MLPs. **Onboard compute reality check.** This surprises people: **most locomotion policies do not need a GPU on the robot.** A typical policy is a 2-3 layer MLP with a few hundred to ~1024 units per layer — on the order of 0.1-2 million parameters. A forward pass is a handful of small matrix multiplies that run in **tens of microseconds on a modern CPU core**. You add a GPU onboard only when the policy consumes images (vision-based manipulation, exteroceptive locomotion with a learned terrain encoder). ``` # Locomotion policy inference cost (rough) # Net: MLP [obs=48] -> 512 -> 256 -> 128 -> [act=12] # FLOPs per forward pass ≈ 2 * (48*512 + 512*256 + 256*128 + 128*12) # ≈ 2 * 197k ≈ 0.4 MFLOP # # At 50 Hz that's 20 MFLOP/s — utterly trivial. # A single CPU core (~10s of GFLOP/s) runs this in ~tens of µs. # => No onboard GPU needed for proprioceptive locomotion. ``` **Control-loop integration.** The policy is one block in a hard-real-time loop. It must: read the latest observation (assembled to *exactly* match the sim observation — same order, same scaling, same history length), run inference deterministically (no dynamic allocation, no GC pauses), and write target positions to the joint controllers, all within the period. A jitter spike that misses the deadline can destabilize a balancing robot. Run the policy thread at real-time priority, preallocate everything, and never let it touch the network or filesystem in the hot path. > **Rule:** Your deployment observation must be *byte-for-byte equivalent in meaning* to your training observation — same fields, same units, same normalization, same history stacking, same action scaling and clipping. The most common deployment bug isn't the network; it's a mismatched observation or action transform between sim and robot. Write the observation-assembly code once and share it between sim and hardware. ## On-robot fine-tuning, safety & limitations **On-robot fine-tuning** sounds appealing — close the last bit of the sim-to-real gap by learning on the real machine — and it's mostly a trap. Real data is slow (one robot, real-time), exploration is dangerous (a half-trained policy flails), and the sample-hungry algorithms that work in sim (PPO) are exactly wrong here. If you must, use an off-policy method (SAC) with a tiny step budget, initialize from the sim policy, constrain exploration noise hard, and run an outer classical safety controller that overrides anything dangerous. In practice, **most 2026 production stacks deploy a frozen sim-trained policy** and improve it by improving the simulator, not by learning on hardware. **Safety** is the hard limitation that keeps RL out of certified, high-consequence applications. A learned policy has: - **No stability guarantees.** Robustness is empirical — it worked across your randomization and test cases — not proven. Out-of-distribution inputs can produce arbitrary outputs. - **Soft constraints.** "Don't exceed joint limits" lives in the reward and can be violated, unlike MPC's hard constraints. - **No interpretability.** When it fails, you can't read off *why* from the weights. The mitigations are architectural, not algorithmic: **action clamping and rate limiting** at the joint level (a learned policy should never be able to command beyond hardware limits), a **classical safety supervisor / runtime monitor** that detects bad states (excessive tilt, limit approach) and triggers a safe fallback (damping-to-stop, sit-down), **extensive out-of-distribution testing**, and **conservative deployment** (don't run the policy in regimes far from its training distribution). For functional-safety context this is the same defense-in-depth philosophy as any robot — the learned policy is treated as an untrusted component wrapped in trusted guards. **Other limitations worth stating plainly:** - **Sim-to-real gap never fully closes.** You manage it; you don't eliminate it. Some tasks (precise force control, deformable objects, complex friction) have gaps too large for current sim. - **Reward specification is hard.** As covered, the reward is a buggy spec and the optimizer exploits it. - **Generalization is narrow.** A policy trained for one robot and one task transfers poorly to others. There's no free lunch across embodiments yet (large robot-foundation-model efforts are early). - **Reproducibility is rough.** RL training is seed-sensitive; "it worked once" is not the same as "it works." > **Rule:** Treat a learned policy as an untrusted component. Wrap it in hard joint-level limits and a classical safety monitor that can take over. Never let the network be the only thing standing between your robot and a hardware-damaging command. ## Data & compute budget The good news for robotics RL: by the standards of large language models, the compute is small. The expensive resource is *engineer time*, not FLOPs. **Policy size.** Locomotion policies are tiny: 2-3 hidden layers, a few hundred K to ~2M parameters. Manipulation and vision-conditioned policies are larger (CNN/transformer front-ends) but still modest. These are not big models. **Training experience.** Locomotion needs roughly 1-5 billion simulation steps. Dexterous manipulation with heavy randomization can need far more (OpenAI's Rubik's Cube consumed the equivalent of thousands of simulated years). Most tasks land in the billions-of-steps range. **Wall-clock and hardware.** With massively parallel GPU sim: - **Quadruped locomotion (flat + rough terrain):** ~10 minutes to ~3 hours on a single modern GPU. - **Humanoid locomotion:** a few hours to ~1 day on one GPU, more if vision-conditioned. - **Dexterous manipulation:** GPU-days, sometimes a small cluster, because the sim is heavier and the randomization wider. **The cost reality:** a flagship locomotion policy costs single-digit to low-tens of dollars of GPU time. The real budget is the weeks of engineer time spent on the simulator's actuator model, the reward function, the observation design, and the sim-to-real debugging. **Optimize for engineer iteration speed, not GPU cost.** A faster sim that lets you run ten experiments a day is worth more than a marginally better algorithm. > **Rule:** Don't buy a cluster for robot RL; buy one good GPU and a fast simulator, and spend the saved money on the engineer who designs the reward and the actuator model. That's where the actual difficulty — and the actual cost — lives. ## Frequently asked questions **Do I need to learn on the real robot?** Almost never in 2026. The dominant paradigm is train-in-sim, deploy-frozen. Real-robot learning is slow, dangerous, and sample-limited. Spend the effort on simulator fidelity and domain randomization instead. On-robot fine-tuning is a niche, last-resort technique fenced by heavy safety guards. **PPO or SAC — which should I start with?** If you have a massively parallel simulator (Isaac Lab), start with PPO; it's the most likely to give you a working policy on the first serious attempt and it scales to thousands of environments. If your data is expensive (single sim, real robot, slow sim), use SAC for its sample efficiency. TD3 is a deterministic-policy alternative to SAC; DDPG is obsolete — skip it. **Why does PPO dominate locomotion if it's sample-inefficient?** Because with massively parallel sim, samples are nearly free — you generate hundreds of thousands of steps per second. PPO's robustness and stability then matter far more than its sample efficiency. Sample-inefficiency only hurts when data is scarce, which sim isn't. **What's the single most important factor for sim-to-real success?** Simulator fidelity, especially the actuator model, plus appropriate domain randomization. The RL algorithm is rarely the bottleneck. A learned or carefully identified actuator model that captures motor delay and torque limits is the highest-leverage thing you can build. **What is teacher-student / privileged learning and why does everyone use it?** You train a teacher policy with access to information available only in sim (true friction, contact forces, terrain map), which lets it learn the skill quickly. Then you distill it into a student that uses only onboard sensors plus a short observation history, so the student learns to *infer* the privileged information online. It decouples learning the skill from learning to perceive, and it's the standard recipe for robust legged locomotion. **Is my reward function going to get hacked?** Yes, assume it will. The optimizer maximizes exactly what you wrote, not what you meant. Penalize the means (energy, smoothness, limits), use hard termination conditions for failures, and *watch the rendered rollouts* — most reward bugs are obvious on video and invisible in the reward curve. **Can RL replace MPC and classical control?** No, and you shouldn't want it to. RL wins on contact-rich, hard-to-model, high-dimensional tasks; MPC and classical control win on well-modeled, accuracy-critical, constraint-hard, certification-needing tasks. The best systems are hybrids that use each where it's strong. Don't put a learned policy on a precision weld seam. **How much compute do I need?** Less than you think. A quadruped locomotion policy trains in minutes to hours on a single modern GPU; the policy itself is a few-million-parameter MLP. Dexterous manipulation is heavier (GPU-days). The expensive resource is engineer time on reward and sim design, not GPU hours. **Do I need a GPU on the robot?** For proprioceptive locomotion, no — the policy is a small MLP that runs in tens of microseconds on a CPU core. You need onboard GPU only when the policy consumes images (vision-based manipulation, learned terrain encoders from depth/camera). See [robot sensors](/posts/robot-sensors-ultimate-guide/) for what those inputs look like. **What framework should I use in 2026?** Isaac Lab (NVIDIA) is the dominant massively-parallel framework, built on Isaac Sim, succeeding the original Isaac Gym. MuJoCo (now with the GPU-accelerated MJX) and Brax are strong alternatives, especially for research and lighter-weight setups. For the RL algorithm code, RSL-RL (PPO, from ETH) and Stable-Baselines3 / CleanRL are common. See [robot simulation & digital twins](/posts/robot-simulation-digital-twin-ultimate-guide/). **Why does my policy work in sim but fall on the real robot?** The usual suspects, in order: (1) observation/action mismatch between sim and hardware — wrong order, scaling, units, or history length; (2) actuator model in sim doesn't capture real motor delay/limits; (3) insufficient or wrong domain randomization, so the policy overfit sim; (4) control-loop latency or jitter on the robot the policy never saw. Check the observation pipeline first — it's the most common bug. **How do imitation learning and RL fit together?** Use imitation (behavior cloning, DAgger) to get the policy into a sensible region or to provide a style reference (e.g., human motion-capture for humanoid gait), then use RL to make it robust and high-performance. Pure BC suffers compounding error and rarely survives the real distribution; pure from-scratch RL wastes compute exploring states demonstrations could have provided. ## Changelog - **2026-06-08** — Initial publication. --- # Drone & UAV Hardware: The Ultimate Guide URL: https://blog.robo2u.com/posts/drone-uav-hardware-ultimate-guide/ Published: 2026-06-07 Updated: 2026-06-20 Tags: drones, uav, quadcopter, flight-controller, esc, propellers, multirotor, fpv, robotics-hardware, guide Reading time: 38 min > A UAV engineer's 2026 deep dive into drone hardware: airframes, BLDC motors and Kv, props, BLHeli_32/AM32 ESCs with DShot, Betaflight vs PX4 vs ArduPilot, sensor fusion, LiPo packs, thrust-to-weight, and how to size a multirotor. A multirotor is the purest control problem in robotics dressed up as a toy. Four spinning props, no moving control surfaces, no steering linkage — just four numbers (the throttle to each motor) and a control loop fast enough to keep an inherently unstable object hovering in the air. Everything you bolt to the frame exists to serve that loop: the IMU that tells it which way is down, the ESCs that turn its commands into phase currents, the battery that has to deliver 100+ amps without sagging the bus voltage into a brownout. Get the loop and its sensors right and a 250 g quad will hold position in gusts. Get them wrong and the same hardware oscillates itself into the ground in two seconds. This guide is about the hardware underneath that loop, from the perspective of someone who has built, flown, and crashed a lot of these. We will treat the multirotor as the underactuated robot it is, then work outward: airframe and size classes, the BLDC motors and how to pick Kv, propellers and prop-motor-ESC matching, ESCs and DShot, flight controllers and the three firmware camps, the sensor suite and why EKF fusion is non-negotiable, power and voltage sag, the sizing math for thrust-to-weight and flight time, payloads and gimbals, control modes, the major drone classes, and where Remote ID leaves you in 2026. **The take**: A multirotor has four actuators and six degrees of freedom, so it is underactuated — it cannot move sideways without first tilting, and it controls attitude entirely through differential thrust between props. That means the whole machine is a thrust-vectoring exercise running on a control loop, and the two things that decide whether it flies well are (1) a thrust-to-weight ratio of at least 2:1 so the controller has authority to spare, and (2) a clean, well-isolated IMU feeding a loop fast enough (1–8 kHz on the gyro) to catch the airframe before it diverges. Pick the motor-prop-ESC trio together against your target voltage and all-up weight; never pick them one at a time. If you remember nothing else: size for thrust-to-weight first, match the prop to the motor and the ESC to the prop's current draw, and treat the IMU mount as a control component, not a screw hole. Companion reading: [brushless DC motors](/posts/brushless-dc-motors-bldc-ultimate-guide/), [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/), [robot sensors](/posts/robot-sensors-ultimate-guide/), [robot power & batteries](/posts/robot-power-batteries-ultimate-guide/), [real-time control systems](/posts/real-time-control-systems-ultimate-guide/), and [motion planning & kinematics](/posts/motion-planning-kinematics-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [The multirotor as a robot](#multirotor-as-robot) 3. [The airframe: size classes, layouts, materials](#airframe) 4. [BLDC motors for props: Kv, stator size, sizing](#motors) 5. [Propellers: diameter, pitch, thrust, efficiency](#propellers) 6. [ESCs: BLHeli_32, AM32, DShot, current rating](#escs) 7. [Flight controllers: MCU, sensors, firmware, the loop](#flight-controllers) 8. [The sensor suite and sensor fusion](#sensors) 9. [Power: LiPo chemistry, C-rating, voltage sag, packs](#power) 10. [Thrust-to-weight and hover throttle](#thrust-weight) 11. [Flight-time estimation](#flight-time) 12. [Payloads and gimbals](#payloads) 13. [Control modes: acro, angle, position hold](#control-modes) 14. [Drone classes and use cases](#classes) 15. [Regulatory note: Remote ID and weight categories](#regulatory) 16. [Selecting a UAV platform](#selection) 17. [Frequently asked questions](#faq) ## Key takeaways - A quadcopter is **underactuated**: four rotor thrusts control six degrees of freedom. It produces only a single body-frame thrust vector (up) plus three torques (roll, pitch, yaw), so all horizontal motion comes from *tilting the thrust vector* — pitch forward, then accelerate. There is no direct sideways force. - Attitude is controlled by **differential thrust**. Roll/pitch come from speeding up one side and slowing the other; yaw comes from the reaction torque difference between clockwise and counter-clockwise props (which is why props alternate spin direction and why a quad needs both CW and CCW props). - **Thrust-to-weight ratio (TWR) is the master spec.** Aim for ≥ 2:1 for stable flight with control margin, 4:1–8:1 for FPV freestyle/racing, ~1.5:1 minimum for a heavy cinematic or mapping platform that you will fly gently. Hover throttle should land near 50% or lower. - **Motor Kv must match prop and voltage.** Low Kv (e.g. 900–1100 Kv on 6S) swings big props slowly and efficiently; high Kv (e.g. 1700–2400 Kv on 6S) spins small props fast for snappy response. The product Kv × battery voltage sets unloaded RPM; the prop sets how much of that RPM you actually reach. - **The prop-motor-ESC trio is one decision.** The prop sets thrust and current draw; the motor must have the stator size and Kv to swing it; the ESC must be rated above the peak phase current that combination pulls. Pick them together against all-up weight (AUW) and pack voltage. - **DShot replaced analog PWM.** DShot300/600 is a digital, checksummed, bidirectional ESC protocol — no calibration, telemetry (eRPM) back to the FC for RPM-based filtering, and immunity to signal-level drift. Use DShot600 on 8 kHz loops, DShot300 on longer wires or for margin. - **BLHeli_32 and AM32** are the dominant 32-bit ESC firmwares. BLHeli_32 is closed and was effectively frozen after the 2023 ownership/export issues; **AM32 is the open-source successor** and is what new designs ship in 2026. Both run trapezoidal/six-step commutation, not FOC — props always spin fast, so FOC's low-speed smoothness buys nothing here. - **Three flight-controller firmwares own the field**: Betaflight (FPV/acro, blisteringly tuned rate loops on STM32 F4/F7/H7), PX4 and ArduPilot (autonomous/enterprise, full position control, mission planning, on H7-class MCUs and Pixhawk-standard hardware). They solve different problems; don't put PX4 on a 5-inch race quad or Betaflight on a survey drone. - The control stack is a **nested loop**: an inner *rate* loop (gyro → angular velocity, 1–8 kHz) inside an *attitude* loop (IMU fusion → angle) inside an outer *position* loop (GPS/optical flow → velocity/position, 10–100 Hz). The fast loop lives closest to the gyro; position is slow and tolerant. - **The IMU is a control component.** A 6-axis gyro/accel (Bosch BMI270, InvenSense ICM-42688-P) feeds the rate loop. Soft-mounting it and filtering motor-frequency vibration (gyro notch filters, RPM filtering from DShot telemetry) is the difference between a clean tune and a hot, oscillating, inefficient mess. - **Sensor fusion via an EKF** turns noisy gyro + accel + mag + baro + GPS into a single state estimate. The gyro is fast but drifts; the accel gives gravity (long-term level) but is noisy under acceleration; GPS gives absolute position slowly. The EKF weighs each by its trust and fuses them. No fusion, no position hold. - **LiPo packs rule multirotors** for energy density per gram and high discharge (C-rating); Li-ion (21700 cells) wins for long-endurance/efficiency builds where you trade peak current for Wh/kg. Voltage sag under load is the real-world killer — a "100C" rating is mostly marketing; size for measured sag, not the label. - **Flight time is dominated by hover power and pack energy**, roughly `t ≈ (capacity_Wh × usable_fraction) / hover_power_W`. Bigger props at lower disc loading and higher TWR margin (so you cruise at low throttle) both extend it; carrying a heavier pack has diminishing returns once the pack's own weight dominates. - **Remote ID is mandatory** in the US and EU for most drones above the smallest class as of 2026. Sub-250 g matters as a regulatory threshold (lighter registration/RID burden in many jurisdictions), which is exactly why the 249 g "sub-250" class exploded. ## The multirotor as a robot Start here, because every hardware choice downstream follows from it. A quadcopter is a rigid body floating in 3D space. A rigid body has six degrees of freedom: three translations (x, y, z) and three rotations (roll, pitch, yaw). To fully command six DOF independently you would need at least six independent control inputs. A quad has four — the four motor thrusts. Four inputs, six DOF. That mismatch is the definition of an **underactuated** system, and it is the single most important fact about the machine. What can four upward-pointing rotors actually produce? Sum the four thrusts and you get one force, pointing straight up out of the airframe's belly — there is no propeller anywhere that can push the quad sideways. Difference the thrusts and you get three torques: - **Roll**: more thrust on the left props than the right (or vice versa) tilts the body about its forward axis. - **Pitch**: more thrust front vs. back tilts it nose-up or nose-down. - **Yaw**: this one is sneakier. Each spinning prop applies a reaction torque on the airframe equal and opposite to the torque it puts into the air. If all four props spun the same way, the airframe would slowly spin the opposite way and you could never stop it. So two props spin clockwise and two counter-clockwise, their reaction torques cancel in the hover, and you *yaw* by deliberately unbalancing them — speed up the two CW props and slow the two CCW props and the net reaction torque rotates the airframe. So the control authority of a quad is: **one thrust magnitude + three body torques = four independent quantities**, exactly matching the four motors. That is the "X" mixer at the heart of every flight controller — a 4×4 matrix that turns (throttle, roll, pitch, yaw) commands into four motor outputs. The consequence for flight: to move horizontally, the quad **must first tilt**. Want to fly forward? Pitch the nose down a few degrees so the thrust vector points slightly forward, and the horizontal component accelerates you. To stop, pitch back. This coupling of attitude and translation is why the loop is nested — you cannot control position without controlling attitude first, and you cannot control attitude without controlling angular rate first. And the body is unstable. Left alone, a hovering quad does not self-right like a fixed-wing aircraft with dihedral; tiny asymmetries (a slightly heavier arm, a prop nick, a gust) make it tip, and once tilted the thrust vector points partly sideways, which accelerates the tilt. Without active stabilization it falls over in a fraction of a second. The flight controller is not a convenience — it is what makes the vehicle a vehicle. > **Rule**: A multirotor is an unstable, underactuated rigid body stabilized entirely in software. The hardware's job is to give that software fast, clean sensing and enough thrust margin to win. Spec the IMU and the thrust-to-weight before you spec anything pretty. This is also why a quad differs from the legged and wheeled robots covered elsewhere on this blog: a [mobile robot](/posts/mobile-robots-amr-agv-ultimate-guide/) can simply stop and sit there stably; a multirotor that stops controlling falls. The control loop never gets to rest. ## The airframe: size classes, layouts, materials The airframe is the skeleton — it sets the prop size, the spacing, the stiffness, and how much it weighs before you add a single gram of electronics. Multirotors are classed by **propeller diameter** and the matching frame **wheelbase** (the motor-to-motor diagonal), measured in inches by tradition even in metric shops, because props are sold in inches. ### Size classes | Class | Prop dia. | Wheelbase | Typical AUW | Typical use | |---|---|---|---|---| | Tinywhoop / micro | 31–40 mm (1.2–1.6") | 65–75 mm | 20–60 g | Indoor, sub-250 g toys | | Toothpick / 2–3" | 2–3" | 100–140 mm | 50–150 g | Indoor/outdoor light FPV | | 5" (the standard) | 5" | 210–250 mm | 350–700 g | FPV freestyle & racing | | 7" long-range | 7" | 300–320 mm | 600 g–1.2 kg | Long-range FPV, cruise | | 10" | 10" | ~450 mm | 1.5–2.5 kg | Cinematic, light mapping | | Cinelifter / heavy | 13–17" | 600–900 mm | 3–10 kg | Camera lifting, payload | | Enterprise / survey | 15–22"+ | 900 mm–1.5 m | 5–25+ kg | Mapping, agriculture, delivery | The 5-inch class is the de facto reference for FPV: a 5" prop, 2207-ish motor, 4S–6S pack, ~250 mm wheelbase, ~500–650 g AUW. Most parts, props, and tribal knowledge orbit this size. ### Layouts - **X (true X / wide-X / stretch-X)**: motors at the four corners, arms equidistant from center (true X) or stretched front-back for camera clearance. This is the standard for FPV and most quads. Symmetric, predictable, the camera sees forward over the props. - **+ (plus)**: one arm forward, one back, two sides. Largely obsolete on quads — the forward arm sits in the camera view and the dynamics are no better. You still see it on some research and legacy frames. - **H**: two parallel side rails connected by a center bridge. Common on cinematic and longer-range builds because the long center deck has room for a big camera/gimbal and the battery, and the rear is clear for an HD camera. Slightly heavier for a given stiffness than a clean X. - **Hex / octo**: six or eight motors, for redundancy (survive a motor/ESC failure) and lift. Heavy-lift and professional cinema/survey rigs go hexa- or octocopter so a single propulsion failure does not mean a crash. ### Materials and stiffness The arms and main plates on serious quads are **carbon fiber** — high stiffness-to-weight, and crucially, a stiff frame keeps the motors' vibration frequencies high and away from the control loop. A floppy arm resonates at low frequency, couples into the gyro, and wrecks your tune. Typical FPV frame plate thickness runs 2.5–4 mm for arms, 1.5–2 mm for top/bottom plates. Bigger frames go thicker or use carbon tube arms. > **Rule**: Frame stiffness is a control-loop spec, not a cosmetic one. A flexy or cracked arm shifts vibration into the gyro band and forces you to over-filter, which adds latency and softens your tune. Replace cracked arms; don't fly them. Cheaper or toy frames use injection-molded nylon/PA12 or glass-filled plastic — flexible (good for crash survival on micros) but too compliant for a tightly tuned larger quad. Aluminum shows up as standoffs and motor mounts, rarely as primary structure (heavy, and it rings). The trade is always the same: stiffer and lighter costs money (carbon, good layup), and flex buys crash resilience at the cost of tune quality. ## BLDC motors for props: Kv, stator size, sizing Drone propulsion motors are **outrunner BLDCs** — the can (with the magnets) spins around a fixed internal stator, the prop bolts to the can. Outrunner topology gives high torque at low-ish RPM in a short, flat package, which is exactly what swinging a prop wants. For the full theory of how these machines work — Kv vs Kt, pole counts, why continuous current is a thermal limit — read the [brushless DC motors guide](/posts/brushless-dc-motors-bldc-ultimate-guide/); here we focus on the prop-specific choices. ### Stator size: the displacement number Drone motors are named by stator dimensions, not the can: a **2207** motor has a stator 22 mm in diameter and 7 mm tall. That four-digit number is the engine displacement of the drone world — bigger stator means more torque and more thermal mass (it can dump more heat before overheating). Common FPV sizes: | Motor (stator) | Class | Typical Kv (6S) | Role | |---|---|---|---| | 0802–1103 | Tinywhoop/2" | 8000–19000 (1S–2S) | Micro | | 1404–1507 | Toothpick/3" | 2700–4500 (4S) | Sub-250 g | | 2004–2205 | 4–5" light | 1700–2750 | Light freestyle | | 2207 | 5" standard | 1700–1950 | Freestyle/race | | 2306–2406 | 5" | 1700–2400 | Race/freestyle | | 2806–3110 | 7" | 850–1300 | Long-range | | 4006–5010+ | 10"+ / heavy | 200–700 | Cinelifter, cargo | Real motors in this space: **T-Motor** (F-series, Velox — the benchmark for FPV), **iFlight** (Xing, Xing2), and **Hobbywing** (XRotor) for the propulsion end; on big enterprise rigs T-Motor's MN/U-series dominate. ### Kv and voltage Kv is unloaded RPM per volt. The unloaded top RPM is `Kv × V_pack`. A 1950 Kv motor on a fully charged 6S pack (25.2 V) spins ~49,000 RPM unloaded; bolt a 5" prop on and aerodynamic load pulls the actual top RPM down to perhaps 28,000–32,000 RPM. The selection logic: - **High Kv + small prop**: spins fast, accelerates the prop quickly, snappy and responsive. Pulls more current, runs hotter, less efficient. FPV racing/freestyle territory. - **Low Kv + big prop**: spins slower, moves more air per rev, lower disc loading, far more efficient and quieter. Long-range, cinematic, heavy-lift territory. The industry shifted FPV from 4S to **6S** around 2020 because higher voltage at the same power means lower current, so thinner wires, cooler ESCs, and less voltage sag. To keep the same prop RPM at 6S you simply drop Kv proportionally — a 2400 Kv/4S motor and a 1600 Kv/6S motor land at similar RPM (4 cells × 2400 ≈ 6 cells × 1600). ### Sizing a propulsion motor Work from thrust, not from Kv. You need a per-motor max thrust such that all motors together give your target TWR: ``` thrust_per_motor_max = (AUW × TWR_target) / n_motors Example: 600 g AUW quad, target TWR 4:1, 4 motors: thrust_per_motor_max = (0.600 kg × 4) / 4 = 0.600 kg = 600 g So each motor+prop combo must produce ≥ 600 g static thrust at full throttle. ``` Then pick a motor-prop combo whose **thrust-test data** (manufacturers publish these tables — thrust, current, power, efficiency per prop at each voltage) shows ≥ that thrust at your pack voltage, and check that the motor's continuous thermal rating tolerates your cruise current. Hover sits near `AUW/n_motors` (here 150 g/motor), so the motor spends most of its life at a small fraction of full throttle — which is good, because full-throttle current on a 2207 can be 30–40 A per motor. > **Rule**: Pick motors from published thrust/current tables at *your* pack voltage and *your* prop, never from Kv alone. Kv tells you nothing about thrust until you specify the prop and the volts. ## Propellers: diameter, pitch, thrust, efficiency The propeller is where electrical power becomes thrust, and it is the most under-respected component on the aircraft. A prop is specified by two numbers and a blade count, e.g. **5×4.3×3** = 5" diameter, 4.3" pitch, 3 blades. - **Diameter** is how much air the disc sweeps. Bigger diameter moves more air at lower velocity, which is fundamentally more efficient (lower *disc loading*, thrust per unit disc area). This is why a big slow prop sips power and a small fast prop guzzles it. - **Pitch** is the theoretical forward travel per revolution — how aggressively the blade bites the air. Higher pitch = more speed potential and more current draw per RPM; lower pitch = more responsive, easier on the motor, better low-speed thrust. - **Blade count**: 2 blades are most efficient (least induced drag, highest top speed); 3 blades are the FPV standard (more thrust and grip in maneuvers, smoother, slightly less efficient); 4–6 blades trade still more efficiency for grip and noise reduction in tight cinematic/indoor flying. Thrust scales roughly with diameter to the 4th power and pitch to the 1st, and with RPM squared — so diameter dominates. Doubling RPM quadruples thrust but raises power by roughly the cube of RPM, which is why throttle response feels so nonlinear and why hover sits low on the stick. ### Prop-motor-ESC matching This is the core integration problem of the whole aircraft. The three parts form a chain: 1. The **prop** sets how much torque the motor must produce at a given RPM, and therefore how much current it draws. 2. The **motor** must have the stator size (torque and thermal mass) and Kv to drive that prop at your voltage without overheating. 3. The **ESC** must be current-rated above the peak the motor pulls swinging that prop at full throttle. Mismatch any link and something fails: an over-pitched prop on an undersized motor cooks the motor and browns out the ESC; an under-pitched prop on a hot motor leaves performance on the table. Manufacturers' thrust-test tables are the source of truth — they list, for each prop, the thrust, current, electrical power, and efficiency (g/W) at each throttle step and voltage. The number to optimize is **efficiency in g/W** at your hover point. A well-matched 5" combo hovers around 7–10 g/W; a big low-disc-loading rig (15" props at low loading) can hit 12–18 g/W; a small overworked micro might be 4–6 g/W. More g/W at hover directly means more flight time. > **Rule**: Match the prop to the motor's torque, then size the ESC above the prop+motor's measured peak current with margin (typically pick an ESC rated ~1.25–1.5× the peak you expect to see). Verify with a thrust stand or trusted published data before maiden flight. ## ESCs: BLHeli_32, AM32, DShot, current rating The Electronic Speed Controller is the BLDC's three-phase inverter — it takes a throttle command from the flight controller and turns it into the commutated phase currents that spin the motor. Each motor needs one ESC; on a quad these are usually combined onto a single **4-in-1** board that stacks under the flight controller. For the inverter and commutation theory, see [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/). ### Trapezoidal, not FOC — and why Drone ESCs run **six-step (trapezoidal) commutation**, sensorless, using back-EMF zero-crossing to estimate rotor position. They do *not* run field-oriented control. This surprises people coming from robot joints, where FOC is the gold standard. The reason is simple: FOC's advantages — smooth torque at very low and zero speed, full torque while stalled, silence — are exactly the regime a prop never operates in. A prop is always spinning fast; back-EMF is strong and easy to track; and the load is a smooth aerodynamic torque, not a precise position hold. Six-step is simpler, cheaper, lower-latency, and entirely adequate. Spending silicon on FOC for a prop is solving a problem you don't have. ### Firmware: BLHeli_32 and AM32 The motor-side firmware running on the ESC's own MCU matters as much as the hardware: - **BLHeli_S** — older 8-bit firmware on simpler ESCs; supports DShot but limited; being phased out. Note: many "BLHeli_S" boards now run **Bluejay**, an open community firmware that adds bidirectional DShot/RPM telemetry to 8-bit hardware. - **BLHeli_32** — the 32-bit standard for years, feature-rich (telemetry, configurable timing, current limiting). It is **closed-source and was effectively frozen** after the 2023 ownership and export-control disruption. Still flying everywhere, but no longer the future. - **AM32** — the **open-source 32-bit firmware** that has become the default for new ESC designs in 2026. Runs on common STM32/AT32-class ESC MCUs, supports bidirectional DShot and telemetry, and is actively developed. If you are buying ESCs today, AM32 is the safe bet. ### DShot: the digital protocol DShot replaced the old analog throttle signals (standard PWM, Oneshot, Multishot) and is the standard in 2026. It is a **digital, packetized** protocol — each frame is 16 bits (11 throttle + 1 telemetry request + 4-bit CRC checksum) sent at a fixed bit rate: - **DShot150 / 300 / 600 / 1200** — the number is the bitrate in kbit/s. DShot600 is the common choice; DShot300 for longer signal wires or extra margin. - **No calibration** — because it is digital, there is no min/max throttle endpoint to calibrate; the values are absolute. - **Checksummed** — a corrupted frame is rejected, not acted on. Far more robust than analog levels that drift with noise. - **Bidirectional DShot (DShot telemetry)** — the ESC sends **eRPM back to the flight controller** over the same wire. This feeds **RPM filtering**: the FC knows each motor's exact rotation frequency and places dynamic notch filters precisely on the motor's vibration harmonics in the gyro signal. This single feature transformed FPV tuning — it lets you filter the noise without the blanket low-pass filtering that used to add latency and softness. ### Current rating ESCs are rated in continuous and burst amps **per motor** (a "4-in-1 50A" means 50 A per channel). The rating is a thermal limit on the MOSFETs and is honest only with adequate cooling and airflow. For a 5" 6S build, 45–60 A per channel is typical; cinelifters and big rigs use 80 A+ ESCs or single ESCs per motor. Always rate the ESC above the peak current your prop-motor combo draws at full throttle, with margin — see the matching rule above. Undersized ESCs are a top cause of in-flight desyncs and burnouts. > **Rule**: For drone propulsion, trapezoidal/six-step ESCs with bidirectional DShot are correct; FOC is the wrong tool. Spend your engineering on filtering, current headroom, and cooling — not on commutation cleverness. ## Flight controllers: MCU, sensors, firmware, the loop The flight controller (FC) is the brain — the board running the stabilization loop. Physically it is an MCU plus an IMU plus a barometer plus a pile of UARTs, on a 20×20 mm, 25.5×25.5 mm, or 30.5×30.5 mm stack-standard board. ### The MCU FCs run **STM32** microcontrollers almost universally: - **F4 (STM32F405)** — the long-time workhorse, 168 MHz Cortex-M4F. Fine for 5" FPV at 4–8 kHz loops. Being superseded. - **F7 (STM32F722/745)** — 216 MHz, more headroom for filters and peripherals. - **H7 (STM32H743/H750)** — 400–480 MHz Cortex-M7. The current high-end for FPV (room for every filter and OSD feature) and the standard floor for serious PX4/ArduPilot autonomy boards, which need the compute for EKF, logging, and multiple sensor streams. Autonomy platforms standardize on the **Pixhawk** open hardware standard (the FMUv5/v6 spec), built by **Holybro** and others, pairing an H7 with redundant IMUs and a clean connector standard. The compute-heavy perception and planning usually run on a *companion computer* (an NVIDIA Jetson or similar SBC) alongside the FC, which sticks to the hard real-time stabilization — the classic MCU/SBC split discussed in [real-time control systems](/posts/real-time-control-systems-ultimate-guide/). ### The control loop: rate → attitude → position The FC runs a nested cascade, fastest loop innermost: 1. **Rate (inner) loop** — reads the **gyro** (angular velocity), runs a PID to drive measured rate to commanded rate, outputs motor mix. This is the hard real-time loop, run at **1–8 kHz** (gyro sampled up to 8–32 kHz). It is what actually stabilizes the airframe. In acro mode, your sticks command rate directly — this loop is the whole flight experience. 2. **Attitude (middle) loop** — fuses gyro + accelerometer (the IMU) into an estimated **angle**, runs a PID to drive angle to commanded angle, and outputs a rate setpoint to the inner loop. Runs at hundreds of Hz. This is "angle/self-level" mode. 3. **Position (outer) loop** — fuses GPS, baro, optical flow, etc. into estimated **position and velocity**, and outputs an attitude setpoint. Runs at **10–100 Hz**. This is GPS position hold, altitude hold, return-to-home, waypoint missions. Each loop's output is the next loop's setpoint. The pattern — fast/simple/critical inside, slow/complex/tolerant outside — is the universal robot control hierarchy. ### The three firmware camps | | Betaflight | PX4 | ArduPilot | |---|---|---|---| | Primary use | FPV racing/freestyle/acro | Autonomous, research, commercial | Autonomous, commercial, all-vehicle | | Control focus | Razor-tuned rate loop, lowest latency | Full position/mission control | Full position/mission control | | Position hold / GPS | Basic (GPS rescue, position hold) | Yes, full | Yes, full, very mature | | Mission planning | No (it's a manual-flight firmware) | Yes (QGroundControl) | Yes (Mission Planner / QGC) | | Vehicle types | Multirotor (some wing) | Multi, VTOL, fixed-wing, rover | Multi, VTOL, plane, rover, boat, sub | | Typical MCU | F4/F7/H7 | H7 (Pixhawk standard) | H7 (Pixhawk standard) | | License | GPL, open | BSD, open | GPL, open | | Tuning vibe | Hands-on, latency-obsessed | Engineered, modular (uORB/EKF2) | Mature, feature-dense, huge param set | Choose by mission, not by fashion. **Betaflight** for anything you fly line-of-sight or FPV by hand where stick-to-prop latency and snap are everything. **PX4** for autonomous and research work, VTOL, and a clean modular codebase. **ArduPilot** for the most mature autonomy feature set across the widest vehicle range — it will fly a quad, a plane, a VTOL, a boat, and a submarine off variations of the same stack. PX4 vs ArduPilot is largely a culture/tooling preference; both are excellent and both run on Pixhawk-class hardware. > **Rule**: Match firmware to mission. Don't run PX4 on a 5" race quad (you'll fight latency and complexity) and don't run Betaflight on a survey drone (it has no mission planner). The hardware can be similar; the firmware encodes the intent. ## The sensor suite and sensor fusion A multirotor knows where it is and which way is up only because of its sensors and the math that fuses them. For the broader treatment of each sensor type, see [robot sensors](/posts/robot-sensors-ultimate-guide/); here's the drone-specific suite. ### The IMU (gyro + accelerometer) The **gyroscope** measures angular velocity on three axes; the **accelerometer** measures linear acceleration (including gravity) on three axes. Together they're a 6-axis IMU, and they are the heart of the FC. Common parts in 2026: **InvenSense/TDK ICM-42688-P** and **Bosch BMI270** — both low-noise, high-rate MEMS 6-axis IMUs. High-end Pixhawk boards carry *redundant* IMUs (two or three) for fault tolerance and voting. The gyro feeds the rate loop and is fast and low-latency but **drifts** (integrating it gives a slowly wandering angle). The accelerometer gives a long-term gravity reference (it knows where "down" is when the vehicle isn't accelerating) but is **noisy** and wrong during maneuvers. Each covers the other's weakness — that's the whole point of fusion. **IMU mounting is a control spec.** Motor and prop vibration at hundreds to thousands of Hz couples into the gyro and corrupts the rate loop. Mitigations: soft-mount the FC on rubber gummies, keep the frame stiff, and apply **RPM filtering** (dynamic notch filters placed on each motor's exact eRPM, fed by bidirectional DShot telemetry). Get this wrong and you over-filter, adding latency, hot motors, and a mushy tune. ### Barometer, magnetometer - **Barometer** (e.g. DPS310, BMP388/390) measures air pressure → altitude. Resolution is tens of centimeters; it drifts with weather and is disturbed by prop wash and canopy pressure, so it's fused, not trusted alone. It's the primary altitude source when GPS altitude is poor. - **Magnetometer** (compass, e.g. QMC5883/IST8310) measures the Earth's magnetic field → heading. Essential for absolute yaw on GPS aircraft. Notoriously corrupted by motor currents and ferrous metal, so it's mounted away from power wiring (often up on the GPS mast) and must be calibrated. FPV quads in acro often skip it entirely — gyro yaw is enough when you're flying manually. ### GPS and RTK - **GNSS/GPS** (**u-blox M8/M9/M10** modules are the standard) gives absolute position to roughly 1–3 m horizontally with a good fix. Needed for position hold, return-to-home, and waypoint missions. - **RTK (Real-Time Kinematic)** uses carrier-phase measurements plus corrections from a base station (or a network) to reach **centimeter-level** positioning — u-blox **F9P**-class receivers are the workhorse. RTK is what mapping and survey drones use to get sub-decimeter geolocation accuracy without dense ground control points. Two RTK receivers on one airframe also give a precise GPS-derived heading (moving-baseline), avoiding compass trouble entirely on big rigs. ### Optical flow, lidar/ToF - **Optical flow** — a downward camera tracks ground texture motion to estimate horizontal velocity, enabling **position hold indoors or anywhere GPS is denied**. Needs a textured surface and adequate light. - **Lidar / Time-of-Flight rangefinders** — a downward laser/ToF gives precise altitude above ground (centimeter-class, GPS-independent) for low-altitude work, terrain following, and precision landing. Forward-facing ToF/radar/stereo enable obstacle avoidance. For the depth-sensing side, see [LiDAR & depth cameras](/posts/lidar-depth-cameras-ultimate-guide/). ### Sensor fusion: the EKF No single sensor gives a clean state. The gyro is fast but drifts; the accel knows down but is noisy; GPS is absolute but slow and jumpy; the baro drifts; the mag is noisy. The **Extended Kalman Filter (EKF)** — PX4's EKF2, ArduPilot's EKF3, Betaflight's lighter complementary/Kalman blend — fuses all of them into one continuously updated estimate of attitude, velocity, and position, weighting each measurement by its modeled trust (its covariance). The gyro propagates the state forward at high rate; the accel/mag/GPS/baro/flow corrections pull it back toward truth. > **Rule**: Position hold is not a sensor; it is a fused state estimate. If the EKF's inputs disagree (a bad compass, a GPS glitch, a vibrating IMU), the estimate is wrong and the aircraft will fight you or fly away — "toilet bowling" on a bad compass is the classic symptom. Trust the fusion only as much as you trust its worst input. ## Power: LiPo chemistry, C-rating, voltage sag, packs The power system has to deliver brutal peak current — a 5" quad can pull 100+ A in a hard punch-out — without sagging the bus voltage into a brownout that resets the FC. For battery fundamentals, see [robot power & batteries](/posts/robot-power-batteries-ultimate-guide/). ### LiPo vs Li-ion - **LiPo (lithium polymer)** is the multirotor default. High discharge rate, high power density per gram, flat-ish discharge curve, cheap. Nominal **3.7 V/cell**, 4.2 V full, ~3.5 V the practical floor under load. The cost is cycle life (a few hundred cycles), fragility, and fire risk if punctured or overcharged. - **Li-ion** (cylindrical 18650/21700 cells, e.g. Molicel P42A/P45B, Samsung 50S) wins on **energy density (Wh/kg)** but has a lower continuous discharge rate. You build Li-ion packs for **long-endurance** flight — 7" long-range cruisers, mapping, survey — where you cruise at modest current and want maximum Wh per gram, not for hard acro. ### S and C ratings - **S = cells in series**, setting voltage. **4S** = 14.8 V nominal, **6S** = 22.2 V nominal (the FPV standard now), big rigs run **12S** and up. Parallel cells (**P**) multiply capacity/current: a "6S2P" Li-ion pack is 6 in series, 2 in parallel. - **C-rating** is the claimed max continuous discharge as a multiple of capacity. A 1300 mAh **100C** pack claims 130 A continuous (1.3 Ah × 100). Treat published C-ratings as **optimistic marketing** — the honest test is measured voltage sag under your actual load. ### Voltage sag: the real-world spec Every pack has internal resistance. Under load, terminal voltage drops by `I × R_internal` — that's **sag**. A tired or under-rated pack sags so much under a punch-out that the bus drops below the FC's brownout threshold and it resets mid-air — instant crash. Sag also means your "6S" pack delivers far less than 25.2 V when it matters. Symptoms of an undersized pack: heavy sag, hot pack after landing, "puffed" cells. > **Rule**: Size the pack by measured voltage sag under your worst-case current, not by the C-rating on the label. If the pack is hot or puffed after a flight, it's over-stressed — go up in C-rating or capacity, or down in current draw. A pack that sags below your FC's brownout voltage is a crash waiting to happen. Pick capacity to balance energy against weight: more mAh means more flight time *until* the pack's own weight dominates and TWR drops, at which point you're carrying battery to carry battery. For a 5" quad, 1100–1500 mAh 6S is the freestyle sweet spot; long-range 7" runs 2500–6000 mAh Li-ion. Always land at ~3.5 V/cell under load (≈3.7–3.8 V resting) — running a LiPo flat kills it fast. ## Thrust-to-weight and hover throttle This is the sizing math that decides whether a build flies well. Two numbers: thrust-to-weight ratio (TWR) and hover throttle. **Thrust-to-weight ratio** is total max static thrust (all motors at full throttle) divided by all-up weight: ``` TWR = total_max_thrust / AUW Example: 4 motors × 1200 g max thrust each = 4800 g total thrust. AUW (frame + electronics + battery + camera) = 650 g. TWR = 4800 / 650 = 7.4 : 1 ``` What TWR you want: - **< 1.5 : 1** — barely flies; sluggish; no control authority margin; only acceptable on heavy-lift rigs you fly gently and never need to fight a gust. - **2 : 1** — minimum for stable, controllable flight with margin. A good target for cinematic and enterprise platforms. - **4 : 1 to 8 : 1** — FPV freestyle and racing. The huge margin gives instant response and the ability to recover from any attitude. - **> 10 : 1** — race-tuned screamers; uncontrollable for beginners, pure speed. **Hover throttle** is where the throttle stick sits to hold a stable hover — the fraction of full thrust needed just to cancel gravity: ``` hover_thrust_fraction ≈ AUW / total_max_thrust = 1 / TWR For TWR 7.4:1: hover ≈ 1/7.4 ≈ 0.135 → ~14% throttle For TWR 2:1: hover ≈ 1/2 = 0.50 → ~50% throttle ``` Because thrust scales roughly with the square of RPM (and RPM roughly with throttle on these systems), thrust is very nonlinear in throttle — so a TWR-7 quad doesn't hover at 14% of *stick*, but well under half. The principle holds: **higher TWR → lower hover throttle → more control authority above hover.** > **Rule**: Target hover at or below ~50% throttle (TWR ≥ 2:1). If you hover near full throttle, you have almost no authority left to fight wind or maneuver — the control loop saturates and the aircraft falls. Add thrust margin before you add anything else. ## Flight-time estimation Flight time is set by how much energy you carry and how fast you burn it in hover (where most flights spend most of their time): ``` 1) Pack energy: E_Wh = capacity_Ah × pack_nominal_voltage e.g. 1.3 Ah × 22.2 V (6S) = 28.9 Wh 2) Hover power (the dominant term): P_hover_W = AUW_kg × g × (1 / efficiency_g_per_W_scaled) In practice: read it off the motor/prop thrust table at hover thrust, OR estimate: P_hover ≈ hover_thrust_grams / (g_per_W at hover) e.g. 650 g hover thrust at 8 g/W → 650/8 ≈ 81 W 3) Flight time (with usable fraction, since you don't fly to 0%): t_min = (E_Wh × usable_fraction) / P_hover_W × 60 e.g. (28.9 Wh × 0.80) / 81 W × 60 ≈ 17 minutes hovering ``` Reality is lower than the hover estimate for FPV (you're rarely hovering — acro burns far more) and close to it for a steady cinematic platform. Key levers, in order of impact: - **Lower disc loading** (bigger props, lower Kv, more efficient g/W at hover) — the biggest sustainable win. Long-range 7" builds fly 20–40+ minutes precisely because they hover at high g/W. - **Higher TWR margin** so you cruise at low throttle, in the prop's efficient regime. - **More pack energy** — but with diminishing returns: past the point where pack weight dominates AUW, adding capacity adds weight that needs more power to lift, and flight time plateaus then falls. - **Lower AUW** everywhere else. Typical numbers: 5" freestyle 4–6 min hard / 7–9 min cruise; 7" long-range 20–40 min; cinematic 10" 15–25 min; large enterprise survey 30–55 min on Li-ion. ## Payloads and gimbals Anything you carry — camera, gimbal, lidar, sprayer, delivery box — is payload, and it eats directly into your thrust margin and flight time. Budget it into AUW from the start, not as an afterthought. A **gimbal** is a motorized 2- or 3-axis (pitch/roll/yaw) stabilized mount that isolates the camera from the airframe's vibration and attitude changes, giving smooth footage. It uses low-Kv **gimbal BLDC motors** run in **FOC** (here FOC *is* the right tool — these motors hold precise position at near-zero speed, exactly the regime where FOC shines, unlike props) with high-resolution encoders, driven by a dedicated gimbal controller with its own IMU. A 3-axis gimbal plus camera on a cinematic rig is a meaningful payload (hundreds of grams to a kilo-plus), which is why camera drones run big low-disc-loading props and 2:1-ish TWR rather than the 7:1 of a featherweight racer. For enterprise work the payload is often a survey camera, multispectral sensor, lidar unit, or RTK-tagged mapping camera — heavy, power-hungry, and the entire reason the aircraft exists. The propulsion is sized around the payload, not the other way around. > **Rule**: Payload is a TWR and endurance tax. Add it to AUW, re-check that you still hover ≤ 50% throttle, and re-run the flight-time math. A camera that drops your TWR below 2:1 means you need a bigger aircraft, not a braver pilot. ## Control modes: acro, angle, position hold The three flight modes map exactly to the three control loops, in order of how much of the stack is active: - **Acro / rate mode** — only the **inner rate loop** runs. Your sticks command angular *velocity*; release the sticks and the quad holds its current attitude (it does *not* self-level). This is what FPV freestyle and racing fly — maximum agility, no limits, full inversions, and it depends only on the gyro. It is also the hardest to fly and the purest expression of the machine. - **Angle / self-level / horizon mode** — the **attitude loop** is active on top. Sticks command a target *angle*; center the sticks and the quad levels itself. Uses the fused IMU (gyro + accel). This is "stabilized" mode — what beginner and most camera flying uses. There's a max tilt limit, so you can't flip. - **Position / GPS hold (loiter, altitude hold)** — the full **position loop** is active. Release the sticks and the aircraft holds its 3D position against wind, using fused GPS/baro/flow. This is the foundation of autonomous flight: position hold, return-to-home, waypoint missions, follow-me. It needs a good fused state estimate — a bad compass or GPS makes it dangerous. The progression is the loop hierarchy made visible: acro is the bare rate loop, angle adds attitude, position adds the outer loop. More automation = more sensors trusted = more ways to fail if a sensor lies, which is the trade you accept for hands-off flight. ## Drone classes and use cases | Class | Frame/props | Firmware | Power | Endurance | Notes | |---|---|---|---|---|---| | FPV racing | 5", X, ultralight | Betaflight | 6S LiPo 1100–1300 | 3–5 min | TWR 8–12:1, latency-obsessed | | FPV freestyle | 5", X | Betaflight | 6S LiPo 1300–1500 | 5–8 min | TWR 4–7:1, durable | | Cinematic FPV | 5–8", X/H + gimbal | Betaflight | 6S LiPo | 6–12 min | HD cam/gimbal payload | | Long-range FPV | 7", X | Betaflight/iNav | 6S Li-ion | 20–40 min | Low disc loading, GPS rescue | | Camera/prosumer | 8–13", X/H | proprietary/PX4 | 6S+ Li-ion | 20–45 min | 3-axis gimbal, obstacle avoid | | Enterprise mapping | 15–22", hex/octo | PX4/ArduPilot | 12S+ Li-ion | 30–55 min | RTK GPS, survey payload | | Heavy-lift/cargo | 17"+, hex/octo | PX4/ArduPilot | 12–14S+ | varies w/ load | Redundancy, big payload | | Fixed-wing/VTOL | wing + lift rotors | PX4/ArduPilot | Li-ion | 45 min–hours | Cruise efficiency of a wing | Two classes deserve a note beyond multirotors: - **Fixed-wing** UAVs trade hover for efficiency — a wing generates lift aerodynamically, so it cruises at a fraction of a multirotor's power and flies for hours. The cost is it can't hover or take off vertically. ArduPilot and PX4 fly these with the same FC hardware. - **VTOL** (vertical takeoff and landing) is the hybrid: lift rotors for vertical takeoff/landing/hover plus a wing and pusher motor for efficient forward cruise. You get a wing's endurance and a multirotor's launch flexibility, at the cost of mechanical and control complexity (the transition between hover and forward flight is the hard part, handled by PX4/ArduPilot's VTOL modes). This is where most serious long-range mapping and delivery work is heading in 2026. ## Regulatory note: Remote ID and weight categories Hardware choices in 2026 are shaped by regulation as much as physics. - **Remote ID (RID)** is effectively mandatory for most drones in the US (FAA) and EU. The drone broadcasts its ID, position, and operator location over Wi-Fi/Bluetooth — either via a built-in standard RID module or a bolt-on broadcast module. Plan for a RID module in your weight and power budget unless you're flying a sub-class exempt aircraft. - **The sub-250 g threshold** is the most consequential number in consumer drone regulation. In many jurisdictions, aircraft **under 250 g** face lighter registration and (in some cases) RID requirements. That single line in the rules is why a whole class of drones is engineered to land at exactly **249 g** AUW — it's a regulatory cliff, not an engineering one. - **Weight/risk categories** (the EU's Open category A1/A2/A3, the FAA's operational rules) scale requirements with mass and proximity to people. Heavier and BVLOS (beyond visual line of sight) operations require more: certified hardware, redundancy, RID, sometimes type certification. > **Rule**: Check the current rules for *your* jurisdiction and weight class before you build, and budget the RID module's weight and power into AUW. The regulatory category often dictates the size class more than the mission does. This is the aviation-grade end of the [functional safety](/posts/robot-safety-functional-safety-ultimate-guide/) story — redundancy and fail-safe behavior aren't optional on a 10 kg machine flying over people. ## Selecting a UAV platform Put it together into a repeatable selection process: 1. **Define the mission and payload first.** FPV freestyle, cinematic, long-range cruise, mapping, delivery? What sensor/camera must it carry, and how heavy is it? This sets everything downstream. 2. **Pick the size class** from the payload and mission (the size table). Payload + endurance usually dictate prop diameter and motor count. 3. **Check the regulatory category** for that weight and operation, and budget RID. The sub-250 g cliff may push the whole design. 4. **Set the AUW budget and target TWR** (≥ 2:1 general, 4:1+ for FPV). Confirm hover lands ≤ 50% throttle. 5. **Pick the prop-motor-ESC trio together** against your pack voltage and per-motor thrust target, using published thrust/current tables. Verify ESC current headroom. 6. **Choose the battery** by chemistry (LiPo for power, Li-ion for endurance), S-count for voltage, and capacity for the energy/weight balance — then validate by measured voltage sag, not C-rating. 7. **Choose the FC and firmware by mission**: Betaflight for manual/FPV, PX4 or ArduPilot for autonomy, on appropriately-sized STM32 (H7 for autonomy or feature-heavy FPV). 8. **Spec the sensor suite for the control modes you need**: IMU always (and mount it well); add baro for altitude, mag + GPS (or RTK) for position/missions, optical flow/ToF for GPS-denied or precision landing. 9. **Run the flight-time math** and check it meets the mission. If not, lower disc loading or AUW before adding battery. 10. **Validate before you trust it**: bench-test thrust and current, check IMU/vibration after first hover, confirm fail-safes (low battery, RC loss, RTH) actually work. Do this in order and the aircraft flies as designed. Skip the TWR and prop-matching steps and you'll spend the maiden flight picking carbon out of the grass. ## Frequently asked questions **Why does a quadcopter need both clockwise and counter-clockwise propellers?** To cancel reaction torque. Each spinning prop pushes back on the airframe with a torque opposite to its own spin. If all four spun the same way, the airframe would spin the other way uncontrollably. Two CW and two CCW props cancel that torque in hover, and *yaw* is produced by deliberately unbalancing them. This is also why you must install props in the correct CW/CCW positions, or the quad flips on takeoff. **What thrust-to-weight ratio do I need?** At least 2:1 for stable, controllable flight with margin; 4:1 to 8:1 for FPV freestyle/racing; around 1.5:1 minimum for a heavy platform you fly gently. The practical test: you should hover at or below ~50% throttle. If you hover near full throttle, the control loop has no authority left to fight wind and you'll crash in any disturbance. **How do I choose motor Kv?** By the prop and the pack voltage, working from thrust tables. Kv × pack voltage is unloaded RPM; the prop pulls actual RPM down. Lower Kv with bigger props for efficiency and endurance (long-range, cinematic, heavy-lift); higher Kv with smaller props for response (racing/freestyle). On 6S, ~1700–1950 Kv is the 5" standard; ~850–1300 Kv suits 7" long-range. Never pick Kv without specifying the prop and the volts. **What is DShot and why is it better than PWM?** DShot is a digital, packetized ESC protocol that sends a 16-bit checksummed throttle frame at a fixed bitrate (DShot300/600 are common). Versus analog PWM it needs no endpoint calibration, rejects corrupted frames via CRC, and — crucially — bidirectional DShot sends each motor's eRPM back to the flight controller, enabling precise RPM-based notch filtering of motor vibration. That filtering transformed FPV tuning by killing noise without adding blanket-filter latency. **Do drone ESCs use FOC?** No. Drone propulsion ESCs run six-step (trapezoidal) sensorless commutation. FOC's advantages — smooth torque at zero and low speed, full stall torque, silence — apply to a regime a prop never operates in (a prop always spins fast). Six-step is simpler, cheaper, lower-latency, and fully adequate for props. FOC *is* used in drone *gimbals*, where the motors hold precise position at near-zero speed. **Betaflight, PX4, or ArduPilot — which should I use?** Match firmware to mission. Betaflight for manual line-of-sight and FPV flying where stick-to-prop latency and agility are everything (no mission planner). PX4 for autonomous and research work with a clean modular codebase, VTOL, and commercial use. ArduPilot for the most mature, feature-dense autonomy across the widest vehicle range (multi, plane, VTOL, rover, boat, sub). PX4 vs ArduPilot is mostly a tooling/culture preference; both run on Pixhawk-class H7 hardware. **What is the rate/attitude/position loop hierarchy?** A nested cascade. The inner **rate loop** (gyro → angular velocity PID) runs at 1–8 kHz and actually stabilizes the airframe; it's all that's active in acro mode. The **attitude loop** (fused IMU → angle PID) wraps it for self-level/angle mode. The **position loop** (fused GPS/baro/flow → attitude setpoint) at 10–100 Hz wraps that for GPS hold and missions. Each loop's output is the next inner loop's setpoint; fast/critical inside, slow/tolerant outside. **Why do I need an EKF — can't I just read the GPS?** No single sensor is reliable alone: the gyro is fast but drifts, the accelerometer knows "down" but is noisy under acceleration, GPS is absolute but slow and jumpy, the baro and mag drift and get disturbed. The Extended Kalman Filter fuses them all into one continuously-updated state estimate, weighting each by its trustworthiness. Position hold is a *fused estimate*, not a sensor reading — and it's only as good as its worst input (a bad compass causes the classic "toilet bowl" fly-away). **LiPo or Li-ion?** LiPo for high discharge and power density per gram — the default for anything that punches out or does acro (FPV, racing, freestyle). Li-ion (21700 cells) for energy density and endurance where you cruise at modest current — long-range, mapping, survey. Don't try to hard-acro a Li-ion pack (it can't deliver the peak current); don't expect LiPo to match Li-ion's Wh/kg for endurance. **What does the C-rating mean and can I trust it?** C-rating is the claimed continuous discharge as a multiple of capacity (a 1300 mAh 100C pack claims 130 A). Treat it as optimistic marketing. The honest spec is measured voltage sag under your actual worst-case current — if the pack sags toward your FC's brownout voltage, or comes back hot or puffed, it's under-rated for your build regardless of the number on the label. **How do I estimate flight time?** Pack energy (Ah × nominal voltage = Wh) times a usable fraction (~0.8), divided by hover power in watts, times 60 for minutes. Hover power you read off the motor/prop thrust table at hover thrust, or estimate from g/W efficiency. The biggest sustainable lever is lower disc loading (bigger, slower, more efficient props), then cruising at low throttle from a high TWR margin. Adding battery has diminishing returns once pack weight dominates AUW. **Why does sub-250 g matter so much?** It's a regulatory cliff. In many jurisdictions, drones under 250 g get lighter registration and (sometimes) Remote ID requirements. That single rule is why a whole class of consumer drones is engineered to land at exactly 249 g all-up weight — the limit is legal, not aerodynamic. Above it, plan for registration and an RID module in your weight and power budget. **Why is the IMU mount considered a control component?** Because motor and prop vibration (hundreds to thousands of Hz) couples through the frame into the gyro and corrupts the rate loop. A floppy frame or a hard-mounted FC pushes that vibration into the gyro's measurement band, forcing heavy filtering that adds latency and softens the tune, runs the motors hot, and wastes power. Soft-mounting the FC, keeping the frame stiff, and using RPM filtering (from DShot telemetry) is the difference between a clean tune and an oscillating mess. ## Changelog - **2026-06-07** — Initial publication. --- # Rotary Encoders for Robotics: Incremental, Absolute & Resolvers — The Ultimate Guide URL: https://blog.robo2u.com/posts/encoders-ultimate-guide/ Published: 2026-06-06 Updated: 2026-06-20 Tags: encoders, rotary-encoder, absolute-encoder, incremental-encoder, resolver, quadrature, position-feedback, robotics-hardware, guide Reading time: 37 min > An engineer-grade guide to rotary encoders for robotics: incremental quadrature, single/multi-turn absolute, resolvers, optical vs magnetic vs inductive sensing, BiSS-C/SSI/EnDat interfaces, accuracy vs resolution, and real-product specs. An encoder is the sensor that tells the controller where the shaft is. That is the whole job. Everything fancy you do with a motor — torque control, smooth velocity profiles, holding a position to a few arc-seconds, commutating a brushless motor — rides on knowing the angle of the rotor and the load. Take the encoder away and a servo collapses back into an open-loop motor that guesses. Encoders are also where a surprising amount of robot money and a surprising amount of robot misery live. They are the component most likely to be mis-specced (resolution confused with accuracy), most likely to fail in the field for boring reasons (EMI on a long cable, a cracked solder joint, condensation on an optical disc), and most likely to be the silent ceiling on your control performance. You can have the best FOC firmware in the world and still get a buzzing, limit-cycling joint because the feedback device quantizes velocity into garbage. **The take**: Resolution is not accuracy, and confusing the two is the single most common encoder mistake in robotics. A 14-bit magnetic encoder advertises 16,384 counts per turn — about 79 arc-seconds per count — but its real angular *accuracy* might be ±0.3° to ±0.5° (1,000–1,800 arc-seconds) once you include nonlinearity and mounting eccentricity. The counts are precise; the angle is not. Spec the encoder on the number that matches your control need: resolution for smooth velocity and low quantization noise, accuracy for absolute pointing and gear-train compensation, repeatability for return-to-home. Then pick the *interface* (quadrature, BiSS-C, EnDat, SSI) and the *sensing technology* (optical, magnetic, inductive, capacitive, resolver) that survive your environment. Companion reading: [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/), [servo motors](/posts/servo-motors-ultimate-guide/), [brushless DC motors (BLDC)](/posts/brushless-dc-motors-bldc-ultimate-guide/), [gearboxes: harmonic & cycloidal](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/), and [robot sensors](/posts/robot-sensors-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [Why position feedback is the foundation of motion control](#foundation) 3. [Incremental encoders: quadrature, PPR vs CPR, and the index pulse](#incremental) 4. [Absolute encoders: single-turn, multi-turn, and no homing](#absolute) 5. [Sensing technologies: optical, magnetic, capacitive, inductive](#sensing) 6. [Resolvers: the rugged analog veteran](#resolvers) 7. [The numbers that matter: resolution, accuracy, repeatability, latency](#numbers) 8. [Digital interfaces: quadrature, SSI, BiSS-C, EnDat, Tamagawa, Hall](#interfaces) 9. [Encoder placement: motor-side vs load-side](#placement) 10. [Commutation encoders for BLDC/PMSM](#commutation) 11. [Noise, EMI, shielding, and cable length](#noise) 12. [Selecting an encoder: a resolution budget and a comparison table](#selecting) 13. [Calibration, eccentricity, and real accuracy from a magnetic encoder](#calibration) 14. [Frequently asked questions](#faq) ## Key takeaways - **Resolution ≠ accuracy.** Resolution is how finely the device divides a turn; accuracy is how far the reported angle is from the true angle. A 14-bit magnetic on-axis encoder can report 16,384 positions while being off by ±0.3° (~1,080 arc-sec). Spec to the number that matches the job. - **Incremental encoders count edges; they don't know absolute position at power-up.** A quadrature A/B pair gives you 4× resolution (CPR = 4 × PPR) and direction; the Z/index pulse gives one absolute reference per turn that you reach by homing. - **Absolute encoders know their angle the instant they power on** — no homing move. Single-turn covers one revolution; multi-turn tracks how many turns (battery-backed counter or geared gear-train). This is what you want on a robot joint with hard limits or a heavy load you don't want to swing on boot. - **Optical is the accuracy champion, magnetic is the robustness/cost champion.** Optical disc encoders reach sub-arc-second accuracy (Heidenhain, Renishaw) but hate dust, oil, and condensation. Magnetic (AS5047, iC-Haus, AMT) shrugs off contamination and shock but caps out around ±0.1–0.5° without calibration. - **Inductive encoders (Renishaw, CUI AMT, Zettlex) are the practical middle ground** — magnetic-grade robustness with better accuracy and immunity to stray magnetic fields. They're eating into both optical and magnetic markets in 2026. - **Resolvers are analog, brushless, and nearly indestructible** — operating to 200°C+, surviving shock and radiation, which is why aerospace, defense, and traction motors still use them. They need a resolver-to-digital converter (RDC) chip and a sine excitation. - **BiSS-C and EnDat 2.2 are the modern digital serial standards.** BiSS-C is open and royalty-light; EnDat is Heidenhain's ecosystem. Both give absolute position, CRC error checking, and fast cyclic reads (BiSS-C clocks to 10 MHz). Tamagawa is the dominant servo-motor encoder protocol in Asia. - **Load-side feedback beats motor-side when there's backlash or compliance.** A motor encoder behind a harmonic drive measures the motor, not the output; the gearbox's lost motion and torsional windup are invisible to it. Dual-encoder (motor + load) is the gold standard for precision arms. - **Commutation needs absolute position within one electrical cycle.** Hall sensors give you 60° electrical resolution (enough to start trapezoidal BLDC), but FOC wants a continuous absolute angle — an absolute single-turn encoder or UVW commutation tracks aligned to the rotor. - **Long cables kill encoders.** Use differential (RS-422) signaling, twisted pairs, shielded cable, and keep encoder runs away from motor phase leads. Single-ended quadrature past ~1 m near a PWM inverter is asking for miscounts. - **You only get real accuracy from a magnetic encoder by calibrating out eccentricity.** Mounting offset between the magnet and the sense IC produces a once-per-turn sinusoidal error; a lookup-table correction (or a self-cal routine) can cut error 5–10×. ## Why position feedback is the foundation of motion control Start from the control loop. A servo joint runs nested loops — position outside, velocity in the middle, current/torque inside — and every one of them needs to know where the shaft is or how fast it's moving. The current loop on a brushless motor needs the *electrical* angle to commutate correctly (see [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/)). The velocity loop needs a clean derivative of position. The position loop needs the absolute or relative angle of the joint. No encoder, no servo — you're back to running the motor open-loop and hoping. The dirty secret is that **velocity usually comes from differentiating position**, and differentiation amplifies quantization noise brutally. If your encoder gives you N counts per revolution and you sample at frequency f, the smallest non-zero velocity you can resolve in one sample period is one count: ``` v_min = (1 count) / (N counts/rev) × f [rev/s] ``` For a 1,000-CPR encoder sampled at 1 kHz, the smallest detectable speed is 1/1000 × 1000 = 1 rev/s = 60 RPM. Below that, the velocity estimate is all zeros and ones — a staircase that the velocity loop tries to chase, producing audible buzz and limit cycles at low speed. This is why direct-drive and quasi-direct-drive joints (low gear ratio, see [robot actuators](/posts/robot-actuators-ultimate-guide/)) demand high-resolution encoders: there's no gearbox multiplying the motor's motion into countable increments at the joint. > **Rule of thumb:** For smooth low-speed velocity control, you want at least 12-bit (4,096 CPR) resolution at the point you're controlling, and 17-bit+ (131,072 CPR) for direct-drive joints that must crawl smoothly. Gearing helps: a 100:1 reducer turns a 4,096-CPR motor encoder into 409,600 effective counts per output revolution — but only if the gearbox has no backlash or compliance (it does; see [section 9](#placement)). The encoder also sets your **commutation quality**. A brushless motor commutated with a coarse or noisy angle wastes current as torque ripple and heat. The smoothness of a [BLDC](/posts/brushless-dc-motors-bldc-ultimate-guide/) running FOC is directly limited by how cleanly the encoder reports the electrical angle. ## Incremental encoders: quadrature, PPR vs CPR, and the index pulse An incremental encoder produces a stream of pulses as the shaft turns. It does *not* know where it is at power-up — it only knows how far and which way it has moved since. That's the defining limitation and the source of the homing requirement. ### Quadrature A/B and the 4× trick The standard incremental encoder outputs two square-wave channels, **A** and **B**, 90° out of phase. That 90° phase relationship is "quadrature," and it carries two pieces of information at once: - **Direction:** which channel leads tells you CW vs CCW. If A leads B, one direction; if B leads A, the other. - **Resolution multiplication:** because A and B each have a rising and falling edge per cycle and they're offset, you get four distinct edges per signal period. A decoder that counts all four edges resolves 4× the line count. This is where **PPR vs CPR** trips people up: - **PPR (pulses per revolution)** = the number of full cycles of channel A per turn = the number of physical lines/poles on the disc. Also called the line count. - **CPR (counts per revolution)** = the number of distinct decoded states per turn. With full quadrature decoding, **CPR = 4 × PPR**. So a "1,024 PPR" encoder gives 4,096 CPR after 4× decoding. Datasheets and marketing love to mix these — a vendor will quote "4,096" and you have to figure out whether that's the line count (giving 16,384 CPR) or the post-decode count. Always confirm which. ### The index (Z) pulse A third channel, **Z** (or I, for index), fires once per revolution at a fixed mechanical reference. It's how an incremental system establishes an absolute reference: drive the axis until you see Z, latch the count, and now you have a known zero. This is the "homing" sequence every incremental-feedback machine runs at boot. ### Decoding quadrature in firmware or hardware Most modern MCUs have a hardware quadrature decoder in their timer peripherals (STM32 "encoder mode," TI C2000 eQEP). Use it — software decoding wastes cycles and risks missed edges at high speed. But understanding the state machine matters for debugging. Here's the classic 4× decode by state transition: ```c // Quadrature 4x decode via state transition table. // Previous state in bits [1:0] = (A_prev<<1 | B_prev) // New state appended -> 4-bit index into a +1/-1/0 table. static const int8_t qdec_table[16] = { // 00-> 01-> 10-> 11-> (new = AB) 0, -1, +1, 0, // from 00 +1, 0, 0, -1, // from 01 -1, 0, 0, +1, // from 10 0, +1, -1, 0 // from 11 }; static uint8_t prev_state = 0; static int32_t position = 0; void encoder_isr(void) { uint8_t a = read_pin(ENC_A); uint8_t b = read_pin(ENC_B); uint8_t state = (a << 1) | b; uint8_t idx = (prev_state << 2) | state; int8_t step = qdec_table[idx]; if (step == 0 && prev_state != state) { // Both bits changed in one sample -> illegal transition. // Either a missed edge (too slow sampling) or noise. error_count++; } position += step; prev_state = state; } ``` The `step == 0` with a state change case is your friend: an *illegal transition* (both A and B appearing to change between samples) means you either undersampled a fast edge or you're picking up noise. Watching `error_count` climb is the fastest way to catch an EMI problem or a too-slow ISR (see [section 11](#noise)). > **Opinion:** Don't software-decode quadrature on a hot loop. If your platform lacks a hardware QEP, use a dedicated decoder IC (LS7366R SPI counter, iC-Haus iC-MD) rather than burning interrupts. A 10,000-CPR encoder at 6,000 RPM emits 1,000,000 counts/s — an ISR per edge will eat a Cortex-M alive. The big advantages of incremental: cheap, simple, well-understood, and the quadrature/RS-422 interface is universal. The big disadvantage: it forgets everything on power loss and needs a homing move. For a [BLDC](/posts/brushless-dc-motors-bldc-ultimate-guide/) you also need commutation info before the index is found — which is why pure incremental motors add Hall sensors or UVW tracks ([section 10](#commutation)). ## Absolute encoders: single-turn, multi-turn, and no homing An absolute encoder reports its actual angular position the moment it powers on, with no movement required. The disc (or magnetic/inductive pattern) is coded so that every angular position has a unique digital word — historically a Gray code on an optical disc, today usually a serial digital word over BiSS-C/EnDat/SSI. This single property — **knowing where you are at boot** — is worth a lot in robotics: - **No homing move.** A robot arm with an absolute encoder on each joint knows its full pose at power-up. No slamming joints into limit switches; no dangerous "find home" dance with a loaded arm hanging in space. - **Safety.** If you lose power mid-task and come back, you still know the pose. Critical for collaborative robots and anything that could fall under gravity. - **Hard limits.** You can enforce joint limits immediately, before the first commanded move. ### Single-turn vs multi-turn - **Single-turn absolute** uniquely encodes position within *one* revolution (0–360°). Perfect for a direct-drive joint that never exceeds one turn, or for commutation (which only cares about angle within an electrical cycle). - **Multi-turn absolute** also tracks *how many full turns* the shaft has made. Essential when the encoder sits before a gearbox (motor side) and the motor spins many turns per joint move, or on a leadscrew/linear axis. There are two ways to build multi-turn, and the choice has real reliability consequences: **Battery-backed (electronic) multi-turn.** A low-power counter keeps running off a backup battery or supercapacitor while main power is off, counting revolutions. Pros: unlimited turn range, compact. Cons: a battery to maintain and replace; if it dies while powered off, you lose the multi-turn count and must re-home. Most Tamagawa and many servo-motor absolute encoders are battery-backed. **Geared (mechanical/true) multi-turn.** A miniature gear train drives secondary code discs (like a mechanical odometer), so the turn count is physically encoded with no power needed. Pros: no battery, retains count indefinitely, true power-down memory. Cons: bulkier, the gear train adds a small accuracy/backlash term, finite turn range (e.g., 4,096 turns). RLS/AksIM and many Heidenhain multi-turn units use this; some use a Wiegand/energy-harvesting pulse to count turns with no battery at all. > **Opinion:** For a battery-free system that must survive months in storage and come back knowing its pose, geared or energy-harvesting multi-turn (RLS AksIM-2, Heidenhain, or a Wiegand-wire counter) beats battery-backed every time. The battery is the thing that strands a robot in the field. If you must use battery-backed, log the battery voltage and warn early. ### Output formats Absolute encoders speak digital serial: **SSI** (simple, clocked), **BiSS-C** (open, fast, CRC-checked), **EnDat 2.2** (Heidenhain), **Tamagawa** (servo-motor standard in Asia), or parallel Gray-code (legacy, lots of wires). We cover the protocols in detail in [section 8](#interfaces). ## Sensing technologies: optical, magnetic, capacitive, inductive The interface (how the encoder talks) is independent of the *sensing technology* (how it physically measures angle). Get the technology right for your environment first — no protocol fixes an optical encoder that fogged up. ### Optical A light source (LED) shines through or reflects off a patterned disc (glass for high-end, mylar/metal for cheaper) onto a photodetector array. Fine line spacing plus interpolation gives extraordinary resolution and accuracy. - **Strengths:** Highest accuracy (Heidenhain and Renishaw reach ±1 to ±5 arc-seconds on precision units), highest resolution (28–32 bits with interpolation), low noise. - **Weaknesses:** Hates contamination — dust, oil, condensation, and fingerprints degrade or kill it. Sensitive to shock/vibration (glass disc). More expensive. Bulkier. - **Use when:** Metrology, machine tools, precision robotics in clean environments, semiconductor handling. ### Magnetic A diametrically magnetized magnet on the shaft spins over a Hall-effect or magnetoresistive (AMR/TMR) sensor array. The IC computes angle from the field vector. The AS5047/AS5048 (ams), MA732 (Monolithic Power), and iC-Haus iC-MU/iC-PV families dominate here. - **Strengths:** Cheap, tiny, robust against dust/oil/condensation, tolerant of shock and vibration, works through non-magnetic barriers. On-axis versions integrate the whole thing in one IC. - **Weaknesses:** Lower accuracy (±0.1° to ±0.5° typical without calibration), sensitive to stray magnetic fields (a nearby motor or magnet), affected by air-gap and eccentricity, temperature drift. - **Use when:** Cost-sensitive, dirty, or harsh-vibration environments; commutation feedback; high-volume products. ### Capacitive A patterned rotor changes capacitance over a sensing array; an ASIC reads the angle. CUI's AMT series popularized this in robotics. - **Strengths:** Robust against dust and magnetic fields (immune to the stray-field problem magnetics have), low power, modular/mountable, often field-configurable resolution. Mid-range price. - **Weaknesses:** Mid-range accuracy (~±0.1–0.2°), sensitive to humidity/condensation and conductive contamination, less common at very high resolution. - **Use when:** You want a magnetic-free, configurable, easy-to-mount encoder near magnets — common on robot joints and benign industrial gear. The CUI AMT102/AMT212 are maker and integrator favorites. ### Inductive A PCB-based transmit coil induces eddy currents in a passive metal target (rotor); receive coils pick up the position-dependent coupling. Renishaw's encoders, CUI's AMT inductive line, and Zettlex (now Celera Motion) pioneered this. - **Strengths:** Magnetic-grade robustness (handles dust, oil, vibration, shock) *plus* immunity to stray DC magnetic fields and better accuracy than basic magnetic (down to arc-minutes). No precision glass, no fragile parts. Works over a larger air gap. Increasingly the default for harsh robotics. - **Weaknesses:** Larger PCB footprint than an on-axis magnetic IC; sensitive to nearby conductive metal and to the target's concentricity; somewhat higher cost than a bare magnetic IC. - **Use when:** Harsh robotics that still needs decent accuracy and can't tolerate magnetic encoders' stray-field sensitivity. This is the category quietly winning in 2026. ### Comparison table | Technology | Typical accuracy | Max resolution | Contamination tolerance | Stray-field immunity | Relative cost | Representative parts | |---|---|---|---|---|---|---| | Optical | ±1 to ±20 arc-sec | 28–32 bit | Poor (sealed helps) | Excellent | High | Heidenhain ECN/RCN, Renishaw RESOLUTE, US Digital E5 | | Magnetic (on-axis) | ±0.1° to ±0.5° | 12–17 bit | Excellent | Poor | Low | ams AS5047/AS5048, MPS MA732, iC-Haus iC-MU | | Capacitive | ±0.1° to ±0.2° | 12–14 bit | Good (not humidity) | Excellent | Medium | CUI AMT102/AMT212/AMT232 | | Inductive | ±arc-minutes to ±0.05° | 18–22 bit | Excellent | Excellent | Medium-High | Renishaw, CUI AMT inductive, Celera/Zettlex IncOder | | Resolver | ±5 to ±20 arc-min | RDC-set (10–16 bit) | Excellent | Excellent | Medium | LTN, Tamagawa Smartsyn, Moog | > **Opinion:** If you're building a robot arm and your knee-jerk is "magnetic, it's cheap and rugged," seriously look at inductive first. You keep the ruggedness, lose the stray-field headache (your motor *is* a stray field), and gain a half-decimal-place of accuracy. The price gap has shrunk a lot. ## Resolvers: the rugged analog veteran A resolver is essentially a rotary transformer. A primary winding on the rotor is excited with an AC reference (typically 5–10 kHz sine), and two stator windings, mechanically 90° apart, output AC signals whose *amplitudes* are modulated by the rotor angle: ``` S1 ≈ E·sin(ωt)·sin(θ) // SIN output winding S2 ≈ E·sin(ωt)·cos(θ) // COS output winding ``` Take the ratio of the two envelopes and `θ = atan2(SIN, COS)`. The angle lives in the *ratio* of two signals, so it's immune to amplitude drift, supply variation, and a lot of noise — a key reason resolvers are so robust. ### Why aerospace, defense, and traction love them - **No electronics in the sensor.** Just copper windings and iron. That means they operate from cryogenic to **200°C+**, survive radiation, shock to hundreds of g, vibration, and decades of service. Nothing to degrade. - **Brushless variants** use a rotary transformer to couple excitation to the rotor — no brushes, no wear. - **Inherently absolute within one turn** (or one electrical cycle for multi-speed resolvers). This is why you find resolvers on aircraft control surfaces, missiles, military servo systems, EV/hybrid traction motors, and steel-mill drives — anywhere the environment would destroy an optical disc or a silicon angle IC. ### The resolver-to-digital converter (RDC) A resolver isn't directly digital; you need an **RDC** chip to generate the excitation and demodulate SIN/COS into a digital angle and velocity. The classic part is the **Analog Devices AD2S1210** (10/12/14/16-bit selectable, tracking converter with velocity output). Newer integrated solutions and microcontroller-based RDC (sampling SIN/COS with ADCs and running a tracking observer in firmware) are common too. > **Watch the tradeoffs:** Resolvers give modest accuracy (±5 to ±20 arc-minutes for a standard single-speed unit; multi-speed resolvers do better) and the RDC resolution is selectable but usually 10–16 bit. They also cost board space, need a tuned excitation, and the cabling carries analog signals you must shield. You pick a resolver for *survival*, not for arc-second accuracy. For a traction or industrial [BLDC/PMSM](/posts/brushless-dc-motors-bldc-ultimate-guide/), the resolver doubles as the commutation sensor — absolute electrical angle straight out of the RDC, no homing, in an environment that would kill an optical encoder. That combination is why they persist despite the analog overhead. ## The numbers that matter: resolution, accuracy, repeatability, latency This is the section to read twice. Most encoder mistakes are spec mistakes. ### Resolution The number of distinguishable positions per revolution. Expressed as CPR/PPR (incremental) or bits (absolute): ``` positions_per_rev = 2^bits angular_step = 360° / 2^bits = 1,296,000 arc-sec / 2^bits ``` | Bits | Counts/rev | Arc-sec/count | Degrees/count | |---|---|---|---| | 10 | 1,024 | 1,266 | 0.352° | | 12 | 4,096 | 316 | 0.088° | | 14 | 16,384 | 79 | 0.022° | | 17 | 131,072 | 9.9 | 0.0027° | | 20 | 1,048,576 | 1.24 | 0.00034° | | 23 | 8,388,608 | 0.155 | 0.000043° | Resolution determines velocity-estimate smoothness and the finest *commanded* increment. It is a quantization number, nothing more. ### Accuracy How close the *reported* angle is to the *true* angle, including disc/pattern errors, interpolation error, eccentricity, and temperature. Always worse than resolution. Often the spec that actually limits your robot's pointing or end-effector position. > **The central rule, again:** Resolution ≠ accuracy. A cheap magnetic IC reporting 14 bits (79 arc-sec/count) can carry ±0.3° (~1,080 arc-sec) of absolute error — meaning ~14 of its "least significant bits" are pure fiction as far as true angle goes. Use those counts for *velocity* and *interpolation* smoothness, not for *absolute pointing*, unless you've calibrated ([section 13](#calibration)). ### Repeatability How consistently the encoder reports the same value for the same physical position, run to run. Often much better than accuracy — a systematic nonlinearity error repeats, so it doesn't hurt return-to-home. This is why a robot can have mediocre absolute accuracy but excellent repeatability (and why we calibrate: turn good repeatability into good accuracy via a lookup table). ### Hysteresis The difference in reported position approaching the same point from opposite directions. In magnetic/inductive systems it's an electrical/filtering artifact; in geared multi-turn it's mechanical backlash. Matters for bidirectional positioning and for velocity-loop stability near zero speed. ### Maximum speed Two limits stack up: - **Mechanical:** bearing/disc RPM rating (optical glass discs and ball-bearing housings limit this). - **Electrical:** the maximum count or output rate. An incremental encoder has a max output frequency; an absolute encoder has a max angular speed beyond which interpolation can't keep up and you get a tracking error or a velocity-warning flag. ``` f_out_max [Hz] = CPR × RPM_max / 60 ``` A 10,000-CPR encoder at 10,000 RPM = 1.67 MHz output — well within RS-422, but check the receiver and decoder rating. ### Latency For digital absolute encoders, the time from "I asked" to "I have the angle" — propagation delay plus protocol transaction time. It directly adds phase lag to your control loop. BiSS-C and EnDat 2.2 minimize this with fast clocks and cyclic reads; a slow SSI poll over a long cable adds microseconds that hurt a fast current loop. For a 10–20 kHz FOC loop, you want the position read to complete in a small fraction of the loop period. > **Rule of thumb:** Budget encoder read latency under ~10% of your fastest loop period. At a 20 kHz current loop (50 µs), keep the position read under ~5 µs. BiSS-C at 10 MHz reading 26 bits + CRC fits; a 1 MHz SSI poll might not. ## Digital interfaces: quadrature, SSI, BiSS-C, EnDat, Tamagawa, Hall The interface is how the encoder hands position to your controller. Pick it to match your controller's hardware support, your latency budget, and whether you need error checking. ### Incremental quadrature (A/B/Z, RS-422) Three differential pairs (A/A̅, B/B̅, Z/Z̅) over RS-422. Universal, simple, decoded by MCU timers. No absolute info, no CRC, no diagnostics. Best for cheap incremental feedback and legacy machine retrofits. ### SSI (Synchronous Serial Interface) Clocked unidirectional serial. The controller drives a clock; the encoder shifts out its absolute position MSB-first on data. Simple, widely supported, but: no standard CRC (some add it), no register access, and the data is only valid as fast as you clock it. Common on older absolute encoders and many industrial sensors. Gray-code option avoids transition glitches. ### BiSS-C Open, license-light bidirectional serial from iC-Haus, built on the SSI physical layer but adding **CRC error checking**, fast clocking (to 10 MHz), and a register-access channel for configuration and diagnostics. Point-to-point, low latency, no royalties — which is exactly why it's everywhere in modern robotics and servo drives. RLS/AksIM, iC-Haus parts, and a huge swath of motor encoders speak BiSS-C. ``` BiSS-C single-cycle read (simplified): Controller drives MA clock burst. Encoder responds on SLO line: [ Ack ][ Start=1 ][ CDS ][ position (n bits) ][ Error ][ Warn ][ CRC6 ] - position: absolute angle, MSB first (e.g. 26 bits = 18 single-turn + 8 multi-turn... varies) - Error/Warn: live status flags (LED degraded, speed exceeded, etc.) - CRC6: inverted, polynomial 0x43 — verify EVERY frame; a failed CRC means drop the sample. ``` > **Opinion:** For a new robotics design needing absolute feedback, BiSS-C is the default I reach for. It's open, fast, CRC-protected, and supported by RLS, iC-Haus, and most drive ICs. EnDat is excellent but ties you to the Heidenhain ecosystem and licensing; SSI is fine but you give up the CRC and diagnostics that catch field failures. ### EnDat 2.2 Heidenhain's bidirectional digital protocol. Absolute position, CRC, parameter memory, error/warning flags, and the ability to read temperature and diagnostics from the encoder. Excellent, tightly integrated with Heidenhain encoders and the drives that support them. Choose it when you're buying Heidenhain glass and want the full diagnostic stack. ### Tamagawa (and the servo-motor family) Tamagawa's smart-encoder serial protocol is the de facto standard on Asian servo motors (and many global ones); related protocols include Nikon, Sankyo, and Panasonic variants. Half-duplex, absolute, with battery-backed multi-turn support. If you're integrating Asian servo motors, your drive needs to speak Tamagawa. ### Hall and UVW commutation tracks Not a position-reporting interface so much as a commutation aid. Three Hall signals give 6 states = 60° electrical resolution. UVW tracks on an encoder do the same in the encoder body. Enough to start a trapezoidal [BLDC](/posts/brushless-dc-motors-bldc-ultimate-guide/); not enough for smooth FOC, which wants the fine A/B or absolute angle ([section 10](#commutation), and [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/)). ### Interface comparison table | Interface | Direction | Absolute? | Error check | Max speed/clock | Diagnostics | Best for | |---|---|---|---|---|---|---| | Quadrature A/B/Z | Output only | No (Z = ref) | None | RS-422, MHz edges | None | Cheap incremental, retrofits | | SSI | Clocked read | Yes | Optional | ~1–2 MHz typical | Minimal | Legacy absolute, simple sensors | | BiSS-C | Bidirectional | Yes | CRC | 10 MHz | Yes (register channel) | Modern robotics/servo, default | | EnDat 2.2 | Bidirectional | Yes | CRC | Fast cyclic | Yes (temp, params) | Heidenhain ecosystem | | Tamagawa | Half-duplex | Yes | CRC | ~2.5 Mbps | Yes | Asian/global servo motors | | Hall/UVW | Output only | 60° elec | None | — | None | BLDC commutation start | ## Encoder placement: motor-side vs load-side Where you mount the encoder changes what it actually measures — and this is one of the most consequential and under-appreciated decisions in a precision robot. ### Motor-side feedback The encoder sits on the motor shaft, *before* the gearbox. This is the cheap, default, integrated-servo arrangement. - **Pro:** High effective resolution at the joint (the gear ratio multiplies counts), small fast encoder, ideal for commutation (it reads the rotor directly), low cost. - **Con:** It measures the *motor*, not the *load*. Everything the gearbox does between motor and output — backlash, torsional windup, lost motion, hysteresis — is invisible. The controller thinks the joint is at angle X; the output is actually somewhere in a band around X/ratio. For a harmonic drive (see [gearboxes: harmonic & cycloidal](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/)), this matters a lot: harmonic drives have near-zero backlash but real *torsional compliance* — under load the output deflects relative to the motor by an angle the motor encoder can't see. A cycloidal drive has its own lost-motion and ripple signature. A motor-side encoder behind either is blind to all of it. ### Load-side feedback A second (usually absolute) encoder mounts on the *output* — the joint itself, after the gearbox. - **Pro:** Measures the thing you actually care about. Closes out backlash, compliance, and gear nonlinearity. Essential for accurate end-effector positioning and for high-stiffness force control. - **Con:** More cost, more wiring, and the gearbox compliance now sits *inside* your position loop — which can destabilize a naive controller (you can get a resonance between motor inertia and gearbox spring with the load encoder feedback). You need dual-loop control to do it right. ### Dual-encoder (the gold standard) The best precision arms run **both**: motor-side for fast inner-loop velocity/commutation and load-side for the outer absolute-position loop. The motor encoder gives you a clean, high-bandwidth velocity signal (no compliance in the path), and the load encoder gives you true output position. This dual-loop architecture is how robots like high-end collaborative arms and surgical robots hit their accuracy. > **Rule:** If your gear train has backlash or meaningful compliance and you care about absolute output accuracy, a motor-side encoder alone will lie to you. Either go dual-encoder, or characterize and compensate the gearbox — and accept that compliance compensation is open-loop and load-dependent. There is no free lunch here. ## Commutation encoders for BLDC/PMSM A brushless motor needs to know the **rotor's electrical angle** to energize the right windings — that's commutation (see [BLDC](/posts/brushless-dc-motors-bldc-ultimate-guide/) and [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/)). The encoder's commutation job is different from its position-feedback job, and confusing them causes the classic "motor cogs or runs backward on first power-up" bug. ### Hall sensors: enough to start Three Hall-effect sensors spaced 120° (electrical) give 6 unique states per electrical revolution = 60° resolution. That's coarse, but it's *absolute at power-up*: you instantly know which 60° sector the rotor is in, enough to start six-step (trapezoidal) commutation without a homing move. Cheap, robust, the standard for low-cost BLDC and for getting moving before a finer sensor is referenced. ### Why Halls aren't enough for FOC Field-oriented control needs a *continuous* electrical angle to align the current vector 90° ahead of the rotor flux. 60° Hall steps produce torque ripple and inefficiency if used directly. So FOC needs either: - a fine **incremental** encoder *plus* a commutation reference (Halls or UVW tracks, or an index-find alignment routine), or - an **absolute single-turn** encoder, which hands FOC the electrical angle directly, with no alignment dance. ### UVW commutation tracks Some encoders provide three extra "UVW" outputs that mimic Hall signals, generated from the encoder's own pattern and aligned to the motor poles. They give the drive the startup sector without separate Hall sensors. You still need the fine A/B or absolute channel for smooth running. ### The alignment problem For an incremental encoder to commutate a BLDC, the controller must learn the offset between the encoder's zero and the rotor's electrical zero. Two approaches: 1. **Forced alignment:** push current into a known phase, the rotor snaps to a known electrical angle, latch the encoder reading as the offset. Simple, but it twitches the motor (bad on a loaded joint) and is sensitive to load/friction. 2. **Use an absolute encoder:** the angle is known at boot, so the offset is a one-time calibration constant stored in the drive. No twitch. This is why **absolute single-turn encoders are the clean answer for BLDC commutation** in robotics — power on, you already know the electrical angle. > **Opinion:** On a robot joint that holds a load against gravity, never accept a commutation scheme that requires a power-on alignment twitch. Use an absolute encoder (or a resolver) so the rotor angle is known before you energize. The twitch is harmless on a benchtop and dangerous on an arm holding a payload. ## Noise, EMI, shielding, and cable length Encoders fail in the field for unglamorous reasons, and electrical noise tops the list. The encoder sits a few centimeters from a switching inverter pushing tens of amps with nanosecond edges. That's a hostile RF environment, and the encoder cable is your antenna. ### Differential signaling is non-negotiable on real machines Single-ended quadrature (one wire per channel referenced to ground) is fine on a 20 cm desk hookup. On a machine, use **differential RS-422**: each channel as a complementary pair (A and A̅) over a twisted pair. The receiver looks at the *difference*, so common-mode noise injected equally on both wires cancels. This single change is the difference between an encoder that miscounts under load and one that doesn't. ### Practical failure modes and fixes - **Long single-ended runs near motor leads → miscounts.** Symptom: position drifts only when the motor is moving/loaded. Fix: differential signaling, separate the encoder cable from phase leads, add shielding. - **Unterminated or wrong-impedance lines → ringing and double-counts.** Fix: terminate RS-422 pairs (typically 120 Ω) at the receiver. - **Shield grounded at both ends → ground loop.** Fix: ground the cable shield at *one* end only (typically the controller/drive end). Ground loops inject current through the shield and couple noise. - **Routing parallel to PWM phase wires → capacitive/inductive coupling.** Fix: cross motor wires at 90°, keep physical separation, use shielded conduit for the encoder run. - **Cable length exceeding driver capability → degraded edges, latency.** RS-422 can run tens of meters, but rise-time degradation eats your max count frequency. For absolute serial (BiSS-C/SSI), long cables limit max clock — propagation delay forces a slower clock or BiSS-C "processing time" compensation. - **Condensation/contamination on optical discs → dropouts.** Fix: sealed encoders, or switch sensing technology (magnetic/inductive) for wet/dirty environments. > **Rule:** Treat the encoder cable like the sensitive analog/digital line it is. Differential pairs, twisted, shielded, shield grounded one end, routed away from power, terminated correctly. Most "the encoder is flaky" tickets are really "the wiring is wrong" tickets. For absolute serial protocols over long cable, BiSS-C and EnDat both have line-delay compensation: the controller measures or is told the round-trip propagation delay and adjusts the sampling so the data lines up. Use it on runs over a couple of meters or you'll cap your clock rate and add latency to your loop ([section 6](#numbers)). ## Selecting an encoder: a resolution budget and a comparison table Don't start from the encoder catalog. Start from the *control requirement* and derive the spec. ### Step 1 — Set the accuracy requirement from the application What absolute positioning error can the *end effector* tolerate? Work that back through the kinematics to a per-joint angular accuracy. If the arm must hold ±0.1 mm at a 500 mm reach, that's roughly ±0.0002 rad ≈ ±41 arc-sec at the joint — before you even budget for gearbox and structural errors. That tells you whether you need optical/inductive accuracy or whether magnetic-plus-calibration suffices. ### Step 2 — Set the resolution from the velocity loop Pick the minimum smooth speed and the loop rate, and invert the velocity-quantization equation from [section 2](#foundation): ``` required_CPR ≥ f_loop / v_min_smooth [counts/rev], v in rev/s ``` For 1 RPM (0.0167 rev/s) smooth control at a 1 kHz loop: CPR ≥ 1000 / 0.0167 ≈ 60,000 counts/rev (~16 bit). Gearing relaxes the motor-encoder requirement by the gear ratio — but only the motor-side smoothness; load-side still needs its own resolution. ### Step 3 — Pick absolute vs incremental Need to know pose at power-up, can't tolerate a homing move, holding a gravity load? Absolute (single-turn for direct drive, multi-turn before a gearbox). Pure speed/velocity job, homing is fine, cost-critical? Incremental. ### Step 4 — Pick the interface Match your controller/drive. BiSS-C for a modern open design; EnDat if you're in the Heidenhain world; Tamagawa for Asian servo motors; quadrature for the cheapest incremental; SSI for legacy. ### Step 5 — Pick sensing technology from the environment Clean and precise → optical. Dirty/vibrating/cheap → magnetic. Dirty but needs accuracy and sits near magnets → inductive or capacitive. Extreme temp/shock/radiation → resolver. ### Step 6 — Form factor and mounting Through-bore vs shafted vs on-axis (magnet on shaft end). Through-bore is great for hollow-shaft robot joints (route cables through). Kit/modular encoders (separate read head + disc/ring) save space and weight on integrated actuators but demand careful mounting tolerance (air gap, concentricity) — which feeds straight into [calibration](#calibration). ### Real-product comparison table | Product | Type | Tech | Resolution | Accuracy (typ) | Interface | Notable | |---|---|---|---|---|---|---| | US Digital E5 | Incremental, kit | Optical | to 5,000 PPR (20,000 CPR) | — (incremental) | Quadrature A/B/Z | Cheap, maker/industrial staple | | CUI AMT212B-V | Absolute single-turn | Capacitive | 12–14 bit, configurable | ±0.2° | RS-485 | Modular, magnetic-free, configurable | | ams AS5047P | Absolute single-turn | Magnetic on-axis | 14 bit | ±0.8° (uncal, max) | ABI/UVW/SPI/PWM | Tiny IC, built for FOC commutation | | iC-Haus iC-MU | Absolute, kit | Magnetic (BiSS) | up to 18 bit | ~±0.5° (cal-dependent) | BiSS-C | High-res magnetic, robotics-friendly | | RLS AksIM-2 | Absolute, off-axis ring | Magnetic | up to 20 bit | ±0.007° (calibrated grades) | BiSS-C / SSI / SPI | Large-bore, functional-safety options | | Renishaw RESOLUTE | Absolute, linear/rotary | Optical | to ~32 bit (1 nm linear) | sub-arc-sec | BiSS-C / others | Metrology-grade, fast | | Heidenhain ECN1325 | Absolute single-turn | Optical | 25 bit | ±20 arc-sec | EnDat 2.2 | Servo-motor integrated, diagnostics | | Broadcom AEAT-9000 | Absolute single-turn | Optical | 17 bit | ±0.025° | SSI | High-res optical module | | Tamagawa TS5700N8401 | Absolute multi-turn | Optical/magnetic | 17 bit ST + 16 bit MT | — | Tamagawa serial | Battery-backed, servo standard | | Celera/Zettlex IncOder | Absolute | Inductive | up to ~22 bit | to ±arc-sec (grade) | SSI/BiSS/SPI | Large-bore, rugged, magnetic-free | > **Opinion:** For a new robot joint in 2026 I'd default to an RLS AksIM-2 (or an inductive ring like the IncOder) on BiSS-C for the load side, and an integrated absolute magnetic (AS5047-class) on the motor for commutation. You get true output accuracy where it matters, robust commutation where it's cheap, CRC-checked serial throughout, and no homing move. Reach for optical (Renishaw/Heidenhain) only when you genuinely need sub-arc-second and can keep the disc clean. ## Calibration, eccentricity, and real accuracy from a magnetic encoder Here's how you turn a cheap, high-repeatability magnetic encoder into a usefully accurate one — and why it works. ### The dominant error: eccentricity In an on-axis or ring magnetic encoder, the single biggest accuracy killer is **eccentricity** — the sense IC (or read head) not being perfectly centered on the magnet/ring's rotation axis. A small radial offset between the magnetic center and the mechanical rotation axis produces a **once-per-revolution sinusoidal error**: ``` θ_error(θ) ≈ (e / R) · sin(θ - φ) [radians] ``` where `e` is the eccentricity offset, `R` is the code-track radius, and `φ` is the phase of the offset direction. A 50 µm eccentricity on a 10 mm-radius ring gives ~5 mrad ≈ ±0.29° peak — which is exactly the order of the "±0.3°" you see in uncalibrated magnetic specs. Mounting tolerance, not silicon, dominates the error. Off-axis ring encoders add higher harmonics (2nd, 3rd) from ring distortion and read-head geometry, but the 1st harmonic (eccentricity) is usually the big one. ### Calibration: turn repeatability into accuracy Because these errors are *systematic and repeatable*, you can measure and subtract them: 1. **Get a reference.** Compare the encoder against a known-accurate reference (a calibrated optical encoder, a rotary index table, or — clever trick — a second encoder of the same type mounted 180° opposite, which cancels the 1st harmonic and lets you self-characterize). 2. **Sweep a full revolution** logging reported vs true angle. 3. **Fit the error.** Either store a dense lookup table (e.g., 1,024 points) or fit the dominant harmonics (`a₁·sin(θ+φ₁) + a₂·sin(2θ+φ₂) + ...`). Harmonic fitting is compact and generalizes; LUT is simplest. 4. **Correct in firmware:** `θ_corrected = θ_raw − error_estimate(θ_raw)`. A good eccentricity/harmonic calibration routinely cuts magnetic-encoder error **5–10×** — taking a ±0.3° raw encoder to ±0.03–0.05°, approaching inductive territory. Many modern ICs (iC-Haus iC-MU, ams, MPS) and modules (RLS AksIM "calibrated" grades) build self-calibration in; AksIM's calibrated grades hit ±0.007° precisely because they characterize each unit on the ring. ### Practical mounting that saves you calibration grief - **Air gap:** hold the read-head-to-target gap within the datasheet window (often 0.1–1.0 mm for magnetic, wider for inductive). Too far = weak signal/noise; too close = saturation/nonlinearity. - **Concentricity:** center the magnet/ring on the rotation axis as tightly as the budget allows — it's cheaper to mount well than to calibrate. - **Stray fields:** keep the motor magnets and current-carrying conductors away from a magnetic sense IC, or use inductive/capacitive to sidestep the problem entirely. - **Temperature:** magnetic field strength and IC offsets drift with temperature; characterize over your operating range if you need the last bit of accuracy. > **Opinion:** A calibrated magnetic encoder is one of the best price/performance plays in robotics — you get inductive-class accuracy from a sub-$10 IC by spending engineering time on mounting and a calibration sweep. The catch is you must own that calibration step; if you can't run a per-unit (or at least per-design) cal in production, buy the accuracy in silicon (inductive or factory-calibrated module) instead. ## Frequently asked questions **What's the practical difference between PPR and CPR?** PPR (pulses per revolution) is the encoder's physical line count — one full cycle of channel A per line. CPR (counts per revolution) is what you get after quadrature decoding: CPR = 4 × PPR, because A and B each contribute a rising and falling edge offset by 90°. A 1,000-PPR encoder yields 4,000 CPR with full 4× decode. Vendors mix the terms, so always confirm which number you're being quoted. **Do I need an absolute encoder, or is incremental plus homing good enough?** If you can tolerate a homing move at every power-up and there's no danger in moving the axis to find its index, incremental is cheaper and fine. If the axis holds a load against gravity, has hard limits you must respect immediately, or can't safely move to home (a loaded arm), use absolute. Most modern robot joints choose absolute for the safety and convenience of knowing pose at boot. **Why is my 16-bit encoder not giving me 16 bits of accuracy?** Because resolution and accuracy are different things. Sixteen bits means 65,536 distinguishable positions (~20 arc-sec each), but the *true angle* may be off by far more due to disc/pattern nonlinearity, interpolation error, eccentricity, and temperature. On a magnetic encoder, mounting eccentricity alone often dominates. The low-order bits are real for relative motion and velocity, but not for absolute pointing unless you've calibrated. **Single-turn vs multi-turn — which do I need?** Single-turn uniquely encodes position within one revolution; use it for direct-drive joints under one turn and for commutation. Multi-turn also counts how many full revolutions, which you need when the encoder sits before a gearbox (the motor spins many turns per joint move) or on a leadscrew. Choose battery-backed multi-turn for unlimited range with a maintenance cost, or geared/energy-harvesting multi-turn for battery-free power-down memory. **BiSS-C or EnDat — which should I pick for a new design?** BiSS-C if you want an open, royalty-light, CRC-protected, fast (to 10 MHz) protocol supported across RLS, iC-Haus, and most drive ICs — it's my default for new robotics. EnDat 2.2 if you're committing to Heidenhain encoders and want their integrated diagnostics (temperature, parameters) and ecosystem. Both are excellent; the choice is mostly about which encoder vendor and drive you're standardizing on. **Can I commutate a BLDC with just Hall sensors?** Yes, for six-step (trapezoidal) drive — three Halls give 60° electrical resolution, enough to start and run a BLDC, absolute at power-up. But for smooth FOC you need a continuous electrical angle, so you add a fine incremental encoder (plus a commutation reference) or use an absolute single-turn encoder. Halls alone produce torque ripple under FOC. See [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/). **Why do aerospace and EV traction systems still use resolvers in 2026?** Survival. A resolver is just windings and iron — no semiconductors in the sensor — so it runs from cryogenic to 200°C+, shrugs off shock, vibration, and radiation, and lasts decades. The cost is modest accuracy (±5–20 arc-min) and the need for an RDC chip and tuned excitation. When the environment would destroy an optical or silicon encoder, the resolver wins. **My encoder counts fine on the bench but drifts when the motor runs hard — what's wrong?** Almost always EMI on the encoder cable. The inverter's switching couples into a single-ended or poorly shielded encoder line and injects false edges. Fix: use differential RS-422 signaling, twisted shielded pairs, ground the shield at one end only, route the encoder cable away from and crossing perpendicular to the motor phase leads, and terminate the lines. Watch for illegal quadrature transitions in firmware as your early-warning flag. **Motor-side or load-side encoder for a geared joint?** Motor-side gives high effective resolution and easy commutation but is blind to gearbox backlash and compliance — it measures the motor, not the output. Load-side measures the true joint angle but puts gearbox compliance inside your loop. For precision arms, run both (dual-loop): motor-side for fast velocity/commutation, load-side absolute for true position. See [gearboxes: harmonic & cycloidal](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/). **How do I get better accuracy out of a cheap magnetic encoder?** Calibrate out the eccentricity. The dominant error is a once-per-turn sinusoid from the magnet not being centered on the rotation axis. Mount it concentrically and within the air-gap window, then sweep a full revolution against a reference and store a lookup table or harmonic-fit correction. This routinely cuts error 5–10×, taking ±0.3° down to ±0.03–0.05°. Many ICs and modules offer built-in self-calibration. **What sensing technology is most robust for a dirty, vibrating robot?** Inductive is my top pick: it tolerates dust, oil, vibration, and shock like a magnetic encoder, but it's immune to stray DC magnetic fields (your motor) and reaches better accuracy. Magnetic is the cheapest robust option if stray fields are managed. Capacitive is a good magnetic-free middle ground but dislikes humidity. Avoid optical in contaminated environments unless it's sealed. **Does encoder latency really matter for my control loop?** Yes, for fast inner loops. Position-read latency adds phase lag to the loop, which erodes phase margin and limits achievable bandwidth. Budget the read under ~10% of your fastest loop period — under ~5 µs for a 20 kHz current loop. BiSS-C at 10 MHz fits comfortably; a slow SSI poll over a long cable may not. On long cables, use the protocol's line-delay compensation. ## Changelog - **2026-06-06** — Initial publication. --- # Soft Robotics: The Ultimate Guide URL: https://blog.robo2u.com/posts/soft-robotics-ultimate-guide/ Published: 2026-06-05 Updated: 2026-06-20 Tags: soft-robotics, pneumatic-actuators, compliant-mechanisms, fluidic-elastomer, mckibben, fin-ray, soft-grippers, robotics-hardware, guide Reading time: 38 min > A 2026 working engineer's guide to soft robotics — fluidic elastomer and McKibben actuators, silicone fabrication, the fluidic control bottleneck, soft sensing, and where compliance actually beats rigid machines. Rigid robots are built from the same assumption as a machine tool: stiffness is good. You want links that don't bend, joints that don't backlash, and a controller that knows exactly where every link is at every instant. That assumption has built the entire industrial robot industry, and it works beautifully when the world is structured and the robot can be kept away from people and fragile things. Soft robotics throws that assumption out. A soft robot gets its motion not from rigid links pivoting about discrete joints, but from continuous deformation of compliant material — silicone that inflates, an elastomer muscle that contracts, a flexure that buckles in a useful direction. Stiffness is no longer the goal; it's a tunable parameter, and often you want very little of it. The payoff is everything rigid robots are bad at: touching a human safely, conforming to an unknown object, surviving an impact that would dent an aluminum link, squeezing a ripe tomato without bruising it. **The take**: Soft robotics is not a replacement for rigid robotics and never will be — it's a complement that wins decisively in a narrow but real set of jobs defined by contact, conformance, and fragility. The field's headline demos (octopus arms, growing vine robots, fully soft autonomous machines) oversell where it stands; the *commercially deployed* reality is much narrower and much more useful: soft and compliant grippers for food and fragile picks, and compliant actuators that make rigid robots safer. The hard, unglamorous bottleneck is not the soft body — silicone is cheap and molding is easy — it's the fluidic control hardware (valves, pumps, regulators) that keeps these machines tethered, slow, and bandwidth-limited. Whoever solves untethered, high-bandwidth fluidic control at low cost unlocks the field; until then, plan for a tether. Companion reading: [robot actuators](/posts/robot-actuators-ultimate-guide/), [end effectors & grippers](/posts/end-effectors-grippers-ultimate-guide/), [robot sensors](/posts/robot-sensors-ultimate-guide/), [humanoid robot hardware](/posts/humanoid-robot-hardware-ultimate-guide/), and [collaborative robots (cobots)](/posts/collaborative-robots-cobots-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [What soft robotics actually is](#what-it-is) 3. [Why compliance matters](#why-compliance) 4. [Actuation methods](#actuation) 5. [Materials & fabrication](#materials) 6. [The fluidic control hardware — the real bottleneck](#fluidic-control) 7. [Sensing in soft bodies](#sensing) 8. [Modeling & control](#modeling) 9. [Soft & compliant grippers](#grippers) 10. [Continuum, growing & vine robots](#continuum) 11. [Applications that actually pay](#applications) 12. [Honest limitations](#limitations) 13. [The hybrid rigid-soft future](#hybrid) 14. [Frequently asked questions](#faq) ## Key takeaways - Soft robotics gets motion from **continuous deformation of compliant material**, not rigid links and discrete joints. Compliance comes from the material (low-modulus elastomer), the structure (flexures, fin-ray), or both. A continuum body has, in principle, infinite degrees of freedom; in practice you control a handful. - The three reasons compliance wins: **safe contact** (a soft body can't deliver a high-force impact), **conformance** (it wraps an object's shape instead of needing to know it), and **robustness** (it survives crashes, overloads, and unstructured environments that wreck rigid machines). - **Pneumatic/fluidic actuation dominates** real soft robots. Fluidic elastomer actuators (PneuNets) bend by inflating asymmetric chambers; McKibben muscles (PAM) contract by ~20–35% when pressurized. Both are cheap, force-dense, and inherently compliant. - Chamber force is just `F = P·A` — pressure times projected area. That makes soft actuators easy to size for force and miserable to control for position, because the same pressure gives different displacement depending on load. - **Silicone is the workhorse material.** Ecoflex (Shore 00-10 to 00-50) for high-strain bending actuators, Dragon Skin (Shore 10A–30A) for tougher skins and grippers. Molding is the default fabrication; 3D printing and lost-wax casting handle complex internal channels. - The **fluidic control hardware is the bottleneck**, not the soft body. Solenoid valves, pumps, and regulators are bulky, power-hungry, and slow — which is why most soft robots are tethered to a benchtop pneumatic rig. Untethered soft robots remain mostly lab demos. - **Sensing inside a soft body is genuinely hard.** Stretchable resistive/capacitive sensors, liquid-metal strain gauges, and optical waveguides all exist, but proprioception — knowing the shape of a continuously deforming body — is nowhere near as solved as reading an encoder on a rigid joint. - **Control is hard for the same reason it's safe.** Infinite DoF, hysteresis, viscoelastic creep, and slow fluidic dynamics make accurate closed-loop control difficult. Constant-curvature models and FEM help; most deployed soft systems run open-loop or with simple pressure control and lean on mechanical compliance instead of precision. - **Soft and compliant grippers are the field's commercial success.** Fin-ray fingers (Festo FinGripper, many clones), Soft Robotics Inc mGrip silicone fingers, and granular jamming grippers handle food, produce, and fragile mixed objects where rigid jaws fail. - **Festo is the reference brand** for industrial-grade soft/compliant hardware: the fluidic muscle DMSP (a productized McKibben), the BionicSoftArm, the MultiChoiceGripper, and the original FinGripper based on the fin-ray effect. - **Honest limits**: low force vs. rigid actuators of equal size, low speed and bandwidth (fluidic dynamics), poor positional accuracy, fatigue/durability of elastomers, and the tether. Don't promise a soft robot will be fast, strong, and precise — pick one. - **The future is hybrid**, not all-soft. The winning architecture is a rigid robot (precise, strong, controllable) with soft end effectors and soft contact surfaces (safe, conformant) — exactly what's already shipping in food and logistics cells. ## What soft robotics actually is Start with a definition that's actually useful on the bench, not the textbook one. A soft robot is a machine whose **primary functional components are made of materials with a modulus comparable to soft biological tissue** — roughly 10⁴ to 10⁹ Pa, spanning silicone rubber up to soft plastics — as opposed to the 10⁹–10¹² Pa of metals and rigid engineering plastics. That's several orders of magnitude softer than a conventional robot link. The consequence is that the body itself deforms to produce or accommodate motion, instead of staying rigid while pin joints do all the moving. Compliance — the inverse of stiffness, measured in m/N or rad/N·m — can come from two places, and it's worth keeping them distinct because they fail and behave differently: - **Material compliance**: the bulk material is soft. Pressurize a silicone chamber and it balloons; the deformation *is* the motion. Fluidic elastomer actuators live here. - **Structural compliance**: the material can be stiff, but the *geometry* is arranged to flex in a useful direction. A fin-ray finger is made of fairly rigid polymer ribs, yet the structure as a whole conforms to a grasped object. Flexure hinges, compliant mechanisms, and notched continuum spines are structural. Most real designs use both. A fin-ray gripper (structural) with a silicone overmold (material) is a common combination. ### Continuum bodies and "infinite" DoF A rigid arm has a finite, countable number of degrees of freedom — six joints, six DoF. A continuum body has a backbone that curves continuously, so in principle its shape needs an infinite number of parameters to describe: every point along the body can be somewhere slightly different. > **Rule of thumb:** A continuum or soft body has theoretically infinite DoF but is *actuated* by only a few inputs (pressures, tendon tensions). The gap between configuration-space dimension and actuation-space dimension is exactly why these robots are underactuated, compliant, and hard to control precisely. In practice you discretize. A constant-curvature model treats a soft segment as a circular arc described by three numbers (curvature, bending plane angle, and length). Stack a few segments and you have a tractable model with maybe 6–12 parameters for the whole arm — close enough to control, far from the true infinite-dimensional reality. The error you accept in that approximation is the error you'll see at the tip. ### What it is *not* Soft robotics is not "robots with rubber covers." Bolting a foam bumper onto a rigid cobot makes it safer but doesn't make it soft in any functional sense — the motion still comes from rigid joints. It's also not the same as a series-elastic actuator, where a spring is placed in series with a stiff motor to add controlled compliance. SEA is a rigid-robot technique borrowed from the same intuition (see [robot actuators](/posts/robot-actuators-ultimate-guide/)); a genuinely soft robot's body deforms as part of its primary function. ## Why compliance matters Three properties fall out of softness more or less for free, and they map directly onto the jobs rigid robots are worst at. ### 1. Safe contact The peak force in a collision is governed by how fast the contact stiffness builds energy. A rigid link hitting a hand transfers energy over a tiny deformation, so force spikes hard and fast. A soft body deforms over many millimeters, spreading the same momentum change over a longer time and larger area — peak force and pressure drop by orders of magnitude. This is the same physics that the [collaborative robots](/posts/collaborative-robots-cobots-ultimate-guide/) world spends enormous effort engineering into rigid arms with torque sensing and speed limits. A soft body gets a lot of it for free, in the mechanics, with no sensor and no control loop in the path. That's not a small thing: passive safety that doesn't depend on software is the kind safety engineers actually trust. ### 2. Conformance A rigid two-finger gripper has to *know* the object — its size, pose, and where to put the fingers — or it crushes, slips, or misses. A soft finger wraps. Pressurize a PneuNets finger against a bell pepper and it follows the pepper's contour, distributing contact over a large area at low pressure. You don't need a precise model of the object; the mechanics do the fitting. > **Rule:** Compliance trades positional knowledge for mechanical adaptation. The less you know about the object, the more a soft, conformant gripper outperforms a precise rigid one — and vice versa. This is why soft grippers dominate food and produce, where every object is a slightly different shape and the cost of a perception-and-planning pipeline to handle that variation is absurd compared to a finger that just conforms. ### 3. Robustness Drop a rigid manipulator and you bend a link or strip a gearbox. Drop a silicone arm and it bounces. Soft bodies tolerate overload, impact, and unstructured environments — squeezing through a gap, getting stepped on, hitting a wall at speed — because the material absorbs and redistributes the energy instead of concentrating it at a joint. For search-and-rescue, exploration, and any environment you can't structure in advance, that robustness is the whole point. The cost of all three benefits is the same thing: you gave up stiffness, and with it force capacity, speed, and positional accuracy. Hold that thought — it's the through-line of the entire field. ## Actuation methods Actuation is where soft robotics gets real, because the body and the actuator are usually the same object. Here are the methods that matter, roughly in order of how much they're actually used. For the rigid-actuator counterparts, see [robot actuators](/posts/robot-actuators-ultimate-guide/). ### Pneumatic / fluidic elastomer actuators (PneuNets) The workhorse of academic and demonstrator soft robotics. A **PneuNet** (pneumatic network) is a slab of silicone with a series of internal air chambers on one side and an inextensible (often paper- or fiber-reinforced) layer on the other. Inflate the chambers and they expand, but the strain-limiting layer can't stretch, so the whole structure curls toward the stiff side. Chain the chambers and you get a finger that wraps into a tight curl at modest pressure. The Harvard group (George Whitesides, Rob Wood, and collaborators) productized this style into the canonical soft-robotics demos — the multigait quadruped, the soft tentacle gripper — and PneuNets remain the first thing most labs build. They run at low pressure (typically 10–50 kPa, i.e. 0.1–0.5 bar), bend a lot, and cost almost nothing in material. The actuation physics is brutally simple. The force a pressurized chamber exerts on its end wall is: ``` F = P · A where F = force on the chamber wall [N] P = gauge pressure [Pa = N/m²] A = projected area of the wall [m²] Example: a PneuNet chamber wall 20 mm × 15 mm = 300 mm² = 3.0e-4 m² at P = 40 kPa = 40,000 Pa: F = 40,000 × 3.0e-4 = 12 N ``` That `F = P·A` is the entire reason soft actuators are easy to size for *force* and hard to control for *position*. Force depends only on pressure and area; displacement depends on pressure, geometry, material modulus, *and the load* — all coupled and nonlinear. ### McKibben muscles / pneumatic artificial muscles (PAM) A McKibben muscle is an elastomer bladder inside a braided, helically-wound inextensible sleeve. Pressurize the bladder and it tries to expand radially; the braid converts that radial expansion into axial *contraction*. The muscle shortens and pulls — exactly like a biological muscle, which only pulls, never pushes. This is the most mature soft-actuation technology by a wide margin, because **Festo productized it as the Fluidic Muscle DMSP**, available in nominal inner diameters of 10, 20, and 40 mm and lengths from ~40 mm up to several meters. Real numbers worth carrying around: - Contraction: roughly **up to 25% of nominal length** (Festo DMSP rates ~25% max contraction). - Force: a DMSP-20 (20 mm bore) delivers on the order of **~1,500 N** initial pull at 6 bar; a DMSP-40 reaches roughly **~6,000 N**. Force is highest at full length and falls to zero near full contraction. - Pressure range: typically **0–6 bar (0–8 bar absolute max)**. - Power-to-weight: excellent — a DMSP-10 weighs tens of grams and pulls hundreds of newtons. McKibben muscles are antagonistic by nature: like biceps/triceps, you pair them across a joint to get bidirectional motion and to set joint stiffness by co-contraction. They're the backbone of compliant exosuits, the Festo BionicSoftArm-style pneumatic manipulators, and a lot of biomimetic legged-robot research. The contraction-vs-force relationship is the key design curve: ``` Approximate Gaylord model for an ideal McKibben muscle: F(P, ε) = (P · D0²) / (4·tan²θ0) · ( 3·(1 - ε)²·... ) [simplified form] Practical takeaway (what you actually use): F_max ∝ P · D0² force scales with pressure and bore squared F(ε) decreases monotonically as contraction ε rises toward ε_max (~0.25) At ε = 0 (full length): force is maximum At ε = ε_max: force ≈ 0 ``` You size the bore for peak force, the length for stroke (stroke ≈ 0.25 × length), and you accept that the force you actually get drops as the muscle shortens through its stroke. ### Tendon-driven (cable) soft actuators Run a cable down a flexible backbone and pull it; the backbone bends toward the cable. This is how most continuum manipulators and a lot of robotic surgery tools work. Tendon drive keeps the heavy, dirty parts — motors — at the base, away from the soft tip, which is exactly what you want for a sterile surgical instrument or a long thin continuum arm. Tendons give you cleaner force transmission than pneumatics (a cable tension is a cable tension), but routing friction, cable stretch, and backlash creep in as the body curves, and you need one motor per controlled DoF plus antagonists. They're rigid-actuator-driven soft *structures* — a useful hybrid. ### Shape memory alloy (SMA) — nitinol Nitinol (nickel-titanium) contracts by a few percent when heated above its transition temperature, recovering a "remembered" shape; cool it and it relaxes. As an actuator it's silent, compact, and produces clean linear pull with no valves or compressors — attractive for small, untethered soft robots. The catch is everything else. SMA is: - **Slow to reset** — actuation is fast (resistive heating) but the return stroke waits for the wire to *cool*, so bandwidth is typically well under 1 Hz unless you actively cool it. - **Energy-inefficient** — you're heating metal; efficiency is a few percent. - **Low-strain** — usable recoverable strain is ~3–5%, so you need long wires or mechanical amplification for useful stroke. - **Fatigue-limited** — cycle life drops sharply at high strain. SMA earns its place in millimeter-scale robots, biomedical devices, and morphing structures where its silence and compactness outweigh its terrible bandwidth. ### Dielectric elastomer actuators (DEA) / electroactive polymers (EAP) A DEA is a thin elastomer film (often acrylic or silicone) coated on both faces with compliant electrodes — a soft capacitor. Apply a high voltage (several kV) and Maxwell stress squeezes the film thinner, so it expands in area. They're fast (hundreds of Hz possible), efficient, and produce large area strain (tens of percent), and they're nearly silent — the closest thing soft robotics has to an "artificial muscle" that's electric rather than fluidic. The blockers are equally real: **kilovolt drive electronics** are bulky and a safety headache, dielectric breakdown limits reliability, and forces are low compared to pneumatics for a given footprint. EAP is the perennial "five years away" technology — genuinely promising for haptics, soft pumps, and small actuators, still mostly out of production hardware in 2026. ### Hydraulic and electro-hydraulic Swap air for an incompressible liquid and you get much stiffer, more controllable actuation at the cost of weight, leaks, and a more complex fluid circuit. Hydraulic soft actuators (e.g. HASEL actuators — hydraulically amplified self-healing electrostatic actuators) combine an electrostatic drive with a liquid dielectric to get muscle-like performance with electric control. Promising in the lab; rare in the field. ### Actuation method comparison | Method | Typical strain / stroke | Force density | Speed / bandwidth | Tether / drive | Controllability | Where it's used | |---|---|---|---|---|---|---| | Fluidic elastomer (PneuNets) | Large (high curvature) | Low–medium | Low (fluid dynamics) | Air line + valves | Poor (open-loop pressure) | Soft grippers, demos, fingers | | McKibben / PAM (Festo DMSP) | ~25% contraction | **High** | Medium | Air line + valves | Medium (antagonistic) | Exosuits, soft arms, legged research | | Tendon-driven | Set by routing | Medium–high | Medium–high | Motors at base | Good (motor-controlled) | Continuum/surgical, vine robots | | SMA (nitinol) | 3–5% | Medium | **Very low** (cooling) | Electric (heat) | Poor (hysteresis) | Micro/biomedical, morphing | | DEA / EAP | Tens of % area | Low | **High** | kV electronics | Medium | Haptics, soft pumps, research | | Hydraulic / HASEL | Medium | High | Medium–high | Pump or kV | Medium | Lab; emerging | > **Engineering reality:** If you're building a soft robot today and you don't have a specific reason not to, you're building pneumatic. Everything else is either a research bet (EAP, HASEL), a niche (SMA), or a rigid-actuator hybrid (tendon). Pneumatics are cheap, force-dense, inherently compliant, and well understood — at the cost of the tether. ## Materials & fabrication The soft body is, honestly, the easy part. Silicone is cheap, forgiving, and you can cast usable actuators on a kitchen table. The art is in choosing the right durometer and getting clean internal channels. ### Silicone elastomers and durometer Silicone is specified by **Shore hardness (durometer)** — Shore 00 for the softest gels, Shore A for firmer rubbers, Shore D for hard plastics. The two brand families that own soft robotics are Smooth-On's **Ecoflex** (very soft, high-elongation) and **Dragon Skin** (tougher, higher tear strength). | Material | Shore hardness | ~Elongation at break | Typical use in soft robotics | |---|---|---|---| | Ecoflex 00-10 | 00-10 (very soft) | ~800% | High-strain bending actuators, stretchable skins | | Ecoflex 00-30 | 00-30 | ~900% | The default PneuNets actuator body | | Ecoflex 00-50 | 00-50 | ~980% | Slightly firmer actuators, better shape hold | | Dragon Skin 10 | 10A | ~1000% | Tougher actuators, gripper fingers | | Dragon Skin 20/30 | 20A–30A | ~360–600% | Wear surfaces, structural skins, durable grippers | | Sorta-Clear / Solaris | ~12A–40A | varies | Optically clear (for optical-waveguide sensing) | | TPU (printed) | 60A–95A | 300–700% | 3D-printed bellows, fin-ray, semi-structural parts | > **Durometer rule of thumb:** Softer = more strain, more conformance, lower force, worse fatigue and tear strength. Firmer = more force and durability, less compliance. Most bending actuators land at Ecoflex 00-30/00-50 for the active body; grippers that touch the world get a Dragon Skin or TPU skin where wear happens. A key trick is **strain limiting**: cast a stiff, inextensible layer (paper, fabric, fiber, or just thicker silicone) on one face so inflation produces bending rather than uniform ballooning. The asymmetry between the stretchy face and the strain-limited face *is* the actuator. ### Molding vs. 3D printing **Molding** is the default. You 3D-print or machine a multi-part mold, mix and degas the two-part silicone, pour, cure, and bond layers. It's cheap, reliable, and gives good material properties. The downsides: it's labor-intensive, multi-step, and complex internal channels mean complex multi-part molds and a lot of manual bonding (where leaks are born). > **Where soft robots leak:** almost always at a bond line between molded layers. Minimize bonded interfaces, design generous bond flanges, and pressure-test every chamber before integration. **Direct 3D printing** of soft parts is maturing fast. You can print TPU bellows and fin-ray fingers on a standard FDM machine; you can print soft silicone-like resins on certain SLA/DLP and material-jetting machines. Printing wins when internal channel geometry is too complex to mold — you get the channels "for free" — but printed elastomers generally have worse fatigue, lower elongation, and layer-adhesion weaknesses compared to cast silicone. **Lost-wax (investment) casting** bridges the two: print or mold a wax core in the shape of the internal cavity, cast silicone around it, then melt the wax out. You get arbitrary single-piece internal channels with cast-silicone material properties and no bond lines. It's the go-to for complex monolithic actuators. ## The fluidic control hardware — the real bottleneck Here's the part the demo videos never show. The graceful silicone tentacle is connected, off-screen, to a workbench covered in solenoid valves, a regulator bank, a compressor or pump, pressure sensors, and a bundle of tubes. **The soft robot is the small, cheap, elegant part; the fluidic control stack is the big, expensive, ugly part — and it's why these machines are tethered.** To control a single pneumatic DoF you need, at minimum: - A **pressure source**: a compressor, a CO₂ cartridge, or a miniature pump. Compressors are heavy and noisy; cartridges run out; micro-pumps are weak. - A **regulator** to set or limit pressure. - **Valves** to route air: a solenoid valve to pressurize, another to exhaust (or a proportional valve to do both). Each chamber typically needs its own. - A **pressure sensor** if you want any feedback at all. - **Tubing and fittings**, which add dead volume and lag. Multiply by the number of independently controlled chambers — a five-fingered soft hand might have 5–15 — and the valve manifold dwarfs the hand it controls. ### Why this caps performance Fluidic systems are slow because **air is compressible and channels have impedance.** Pressurizing a chamber means filling a volume through a finite-diameter tube; the time constant is set by tube resistance, chamber compliance, and dead volume. You can't snap a soft pneumatic actuator the way you can step a servo. Realistic bandwidths are **single-digit hertz** for most molded actuators — fine for a gripper that opens and closes a few times a second, hopeless for dynamic, high-frequency motion. > **Rule:** Pneumatic soft actuators are pressure sources, not position sources. You command pressure; displacement is whatever the load lets you have. Want position control? You're adding a sensor and fighting compressibility, hysteresis, and lag. Proportional valves and pressure-control loops improve things but cost money and add electronics. Binary (on/off) solenoid control is cheap and is what most production soft grippers use — pressurize to grip, exhaust to release, done. ### The untethered problem Cutting the tether means carrying the *entire* fluidic stack on board: pump, valves, power, and control. That's heavy and power-hungry, which fights the lightness that made the soft robot attractive. The field's untethered demos (combustion-powered jumpers, onboard-pump crawlers, the soft "Octobot" with a chemical fuel and microfluidic logic) are genuine achievements precisely because untethering is so hard — and none of them are practical machines yet. In 2026, **if you're deploying a soft robot, plan for a tether** or accept a tiny duty cycle from a cartridge. This is the single biggest reason soft robotics hasn't escaped the lab faster. The body scales beautifully; the plumbing doesn't. ## Sensing in soft bodies A rigid joint has an encoder and you know its angle to arc-seconds (see [robot sensors](/posts/robot-sensors-ultimate-guide/) and our [encoders guide](/posts/encoders-ultimate-guide/)). A soft body has a continuously deforming shape and no obvious place to mount a rigid sensor. **Proprioception — the robot knowing its own shape — is the hardest open problem in soft robotics**, and it's why so many soft systems run blind. The constraint is that any sensor embedded in a soft body must stretch with it without stiffening it or fatiguing. That rules out most conventional sensors and forces you into stretchable electronics. ### Stretchable sensor technologies - **Resistive (piezoresistive)**: conductive composites (carbon-filled elastomer) or liquid-metal channels (eutectic gallium-indium, eGaIn) whose resistance changes as they stretch. Liquid-metal microchannels are the most-cited soft strain gauge — they stretch with the body and don't fatigue like a solid trace. Drift and hysteresis are the recurring headaches. - **Capacitive**: a stretchable dielectric between compliant electrodes; capacitance changes with strain or with applied pressure. Capacitive sensors are more linear and less drifty than resistive, and dominate soft *tactile* sensing. They're sensitive to the electronics and to electromagnetic noise. - **Optical waveguides**: route light through a clear, stretchable waveguide; bending or stretching the waveguide changes the transmitted intensity. Immune to electrical noise, good for distributed sensing, but needs an optical source/detector and clear material. - **Pneumatic (self-sensing)**: measure the pressure and volume of the actuating air itself and infer shape. Cheap (the valve manifold already has pressure sensors) but only loosely coupled to actual shape, especially under external load. - **Magnetic**: embed small magnets and sense field changes with Hall sensors. Good for discrete deflection sensing, harder for distributed shape. ### Why proprioception stays hard Even with good local strain sensors, reconstructing the continuous 3D shape of a soft body from a few discrete measurements is an ill-posed inverse problem, made worse by hysteresis (the sensor reads differently loading vs. unloading), creep (the elastomer keeps deforming under constant load), and the simple fact that external contact changes the shape independently of the actuation. The honest state of the art: you can sense *that* a soft gripper has gripped something, and roughly how hard, far more easily than you can know the exact pose of a soft arm's tip. That asymmetry shapes what soft robots are good for. ## Modeling & control Everything that makes a soft robot safe makes it hard to model. Infinite DoF, nonlinear hyperelastic material, hysteresis, viscoelastic creep, and slow fluidic actuation all stack up. There's no soft-robot equivalent of the clean rigid-body kinematics in our [motion planning & kinematics guide](/posts/motion-planning-kinematics-ultimate-guide/) — you trade exactness for tractable approximations. ### Constant-curvature (PCC) models The dominant tractable model is **piecewise constant curvature (PCC)**: assume each soft segment bends into a circular arc of uniform curvature. Each segment is then described by curvature κ, bending-plane angle φ, and length L. This makes forward kinematics analytic and fast. A useful first-order relation for a single bending fluidic actuator ties pressure to curvature: ``` Approximate constant-curvature bending model: κ ≈ k · P (bending curvature roughly proportional to pressure) θ = κ · L = k · P · L (tip bend angle for an unloaded segment) where κ = curvature [1/m] P = gauge pressure [Pa] L = segment length [m] θ = total bend angle [rad] k = a calibration constant lumping material modulus, wall geometry, and strain-limiting layer [1/(m·Pa)] Reality check: k is only constant for small strain and zero external load. Add a tip load or large deflection and the relationship goes nonlinear, which is why you calibrate per actuator and re-check under load. ``` PCC works well when the body is slender and lightly loaded, and breaks down under heavy tip loads, gravity on a horizontal arm, or external contact — exactly the conditions soft robots operate in. It's a starting point, not a final answer. ### FEM and reduced-order models For accuracy you go to **finite element modeling** of the hyperelastic material (Yeoh, Ogden, or Mooney-Rivlin constitutive models). FEM captures the real deformation but is far too slow for real-time control. The active research direction is **reduced-order models** — distilling an offline FEM into something that runs in a control loop (the SOFA framework and its soft-robotics plugin are the reference tools here). Learning-based models (train a neural net on the robot's own data) are increasingly common precisely because the physics is so hard to write down cleanly. ### Why closed-loop control is hard Closed-loop control needs (a) a model and (b) state feedback. Soft robots are weak on both: the model is approximate and nonlinear, and the state (shape) is hard to measure. Add fluidic lag and hysteresis and you have a plant that's slow, uncertain, and underactuated. > **Rule:** Most deployed soft systems don't do precise closed-loop shape control — they exploit mechanical compliance so they don't *have* to. The control problem you avoid by being soft is the same one you can't solve because you're soft. Lean into open-loop pressure control plus conformance, and reserve closed-loop ambitions for the lab. ## Soft & compliant grippers This is where soft robotics actually makes money. Grasping is the field's commercial beachhead because the value proposition is concrete: handle variable, delicate, or food-grade objects that defeat rigid jaws and vacuum cups. For the full gripper landscape, see [end effectors & grippers](/posts/end-effectors-grippers-ultimate-guide/); here's the soft slice. ### Fin-ray fingers The **fin-ray effect** is a structural-compliance trick borrowed from fish-fin anatomy. A fin-ray finger is a triangular structure with two outer ribs joined by angled crossribs; push on the outer face and, counterintuitively, the finger bends *toward* the load and wraps around it. No actuation in the finger itself — it just deforms passively to conform. **Festo's FinGripper** was the productized original; the geometry is now everywhere (Festo, many third parties, and printed clones). Fin-ray fingers are cheap, printable in TPU, passively conformant, and food-compatible in the right materials. They're driven by an ordinary parallel gripper — the compliance is purely in the fingertips. For mixed produce and irregular parts they're often the single best price/performance choice in all of soft robotics. ### Soft fingers — silicone bellows (Soft Robotics Inc mGrip) **Soft Robotics Inc's mGrip** is the commercial face of fluidic-elastomer grippers. The fingers are molded silicone bellows actuators (PneuNets-style): pressurize and they curl inward to envelop an object, exhaust and they open. The system ships with a food-grade material set, a control box (the fluidic stack, sold as a unit), and modular finger arrangements. The pitch is exactly the conformance argument: pick a croissant, a chicken breast, a soft fruit, a bag of salad — variable, delicate, hard-to-model objects — at high cycle rates without bruising, and switch SKUs without retooling. This is the clearest example of soft robotics paying its way in production, primarily in food primary and secondary handling. ### Granular jamming grippers A different and clever mechanism: a flexible membrane filled with granular material (ground coffee is the textbook filler). Press the soft bag onto an object so it conforms, then **pull a vacuum** on the bag — the grains lock together (jamming transition) and the whole thing turns rigid, gripping by a mix of interlocking, friction, and suction. Release the vacuum and it goes soft again. Granular jamming is brilliant for picking a wide range of object shapes with one universal gripper and no fingers. The limits: it needs a face to press against, it's slower (press-jam-lift-unjam cycle), grip force is modest, and dust/wear of the granular medium is a maintenance item. ### Soft gripper comparison | Gripper type | Compliance source | Actuation | Best for | Weakness | |---|---|---|---|---| | Fin-ray (Festo FinGripper) | Structural | External parallel gripper | Irregular/produce, cheap conformance | Limited grip force, single bend plane | | Silicone bellows (mGrip, PneuNets) | Material | Pneumatic per finger | Delicate food, variable SKUs | Tether/valve box, fatigue, speed | | Granular jamming | Material + vacuum | Vacuum | Universal shape, single gripper | Needs press surface, slow, modest force | | Festo MultiChoiceGripper | Structural (reconfigurable) | Pneumatic | Switching grasp modes (parallel/centric) | Complexity, industrial-research niche | | Tendon soft fingers | Hybrid | Tendon/motor | Dexterity, anthropomorphic hands | Routing friction, cost, control | Note the **Festo MultiChoiceGripper**: a bionic design (inspired by the human hand) whose fingers can be reconfigured between parallel and centric grasping modes — a nice illustration of structural compliance plus mode-switching, and a reminder that Festo treats these bionic projects as technology showcases that feed into industrial products like the DMSP muscle and FinGripper. ## Continuum, growing & vine robots Beyond grippers, the soft-body idea scales into whole manipulators and locomotors. ### Continuum manipulators A continuum arm has a slender, continuously bending backbone — think elephant trunk or octopus arm — actuated by tendons, pneumatics, or both along its length. **Festo's BionicSoftArm** is the flagship industrial-grade example: a modular pneumatic continuum manipulator built from bellows segments, lightweight and inherently compliant, pitched for safe human-robot collaboration and for reaching into cluttered or constrained spaces a rigid arm can't navigate. It's a technology demonstrator, but it's the cleanest picture of where a soft manipulator could sit alongside the rigid arms in our [industrial robot arms guide](/posts/industrial-robot-arms-ultimate-guide/). Continuum arms shine at **reach into clutter** — inspecting inside a jet engine, navigating around obstacles, working close to people — and struggle at everything requiring stiffness or precision at the tip. They're the geometric opposite of the rigid arm's strength. ### Growing / vine robots The most genuinely novel soft-robot architecture is the **growing (vine) robot**: a thin-walled inverted tube that extends by everting — turning itself inside out — from the tip as internal pressure pushes new material out the front. Because growth happens only at the tip, the body doesn't drag against the environment as it advances, so a vine robot can snake through rubble, around corners, and into pipes with almost no friction along its length. Vine robots are a real and active area (the Stanford/Okamura line of work is the reference) with concrete uses in **search-and-rescue** (threading into collapsed structures), **medical** (steerable catheters/endoscopes), and inspection. They're still mostly research, but the everting mechanism is one of the few soft-robot ideas with no rigid-robot analog at all — which is exactly why it's interesting. ## Applications that actually pay Separate the hype from the deployed. Here's where soft robotics earns money or is close to it, roughly in order of maturity. ### Food and produce handling — deployed The clear winner. Variable, delicate, hard-to-model objects (bakery, meat, produce, confectionery) at high cycle rates, with food-grade material requirements. Soft silicone fingers (mGrip) and fin-ray grippers conform to each item without bruising and switch products without retooling. This is the soft-robotics business case that already works at scale. ### Fragile and mixed-SKU pick — deployed / scaling E-commerce and logistics handle vast catalogs of objects with unknown, varied shapes. Soft and adaptive grippers (often hybrid with vacuum) tolerate the variability better than rigid jaws. Granular jammers and soft fingers show up in bin-picking and order fulfillment where one gripper must handle many shapes. ### Medical and surgical — scaling Compliance is intrinsically valuable inside a body: a soft or continuum instrument is gentler on tissue and can navigate anatomy a rigid tool can't. Tendon-driven continuum tools dominate minimally-invasive surgery; soft and steerable catheters, endoscopes, and capsule-style devices are an active and well-funded area. Sterility favors tendon drive (motors stay outside the patient). ### Wearables and exosuits — scaling Soft exosuits use textile-and-cable or pneumatic (McKibben/DMSP) actuation to assist human motion without a rigid exoskeleton's bulk and joint-alignment problems. The Harvard/Wyss soft exosuit line is the reference; assistance for walking, load carriage, and rehabilitation is the target. Compliance here is doubly valuable — safe against the body and adaptable to the wearer. ### Search, rescue, and inspection — emerging Vine/growing robots and soft crawlers for unstructured, fragile, or confined environments. Robustness and conformance are the selling points; the tether and control immaturity keep most of this in the field-trial stage. ### Reality filter > **Rule:** If the job is defined by *contact, conformance, or fragility*, soft is a serious candidate. If it's defined by *force, speed, or precision*, soft is the wrong tool — use a rigid robot, possibly with a soft end effector. Most "soft robotics will replace X" claims fail this test. ## Honest limitations Every benefit of softness has a matching cost. Sell the costs as hard as the benefits or you'll over-promise. ### Force For a given size, a soft actuator delivers less force than a rigid one, and the force is load-dependent and falls through the stroke (recall `F = P·A` and the McKibben force-vs-contraction curve). McKibben muscles are the exception — they're genuinely force-dense — but most molded fluidic actuators are weak. If you need high, repeatable force, soft is fighting uphill. ### Speed and bandwidth Fluidic dynamics cap pneumatic soft actuators at single-digit hertz for most designs. SMA is worse (cooling-limited). Only DEA/EAP is intrinsically fast, and it's not in production. Don't design a dynamic, high-frequency task around a fluidic soft actuator. ### Positional accuracy Hysteresis, creep, compressibility, and infinite-DoF underactuation mean soft robots are imprecise. You can get a soft arm roughly where you want it; you can't get it there to a tenth of a millimeter repeatably without heroic sensing and control. Accuracy is the price of compliance. ### Durability and fatigue Elastomers fatigue, tear, abrade, and creep. Bond lines leak. UV, ozone, oils, and cleaning chemicals degrade silicone over time. Cycle life is improving but a soft actuator under high strain has a finite, often modest, fatigue life — and replacement is a recurring cost. Specify the chemical and wear environment up front; it kills more soft grippers than overload does. ### Control and the tether The control problem is hard (above), and the fluidic-control bottleneck keeps most soft robots tethered to a benchtop valve-and-pump rig. Until onboard fluidic control gets small, cheap, and powerful, "untethered soft robot" mostly means "research paper." ### Soft vs. rigid tradeoffs | Dimension | Rigid robot | Soft robot | |---|---|---| | Positional accuracy | Excellent (encoder + stiff link) | Poor (hysteresis, creep, infinite DoF) | | Force / payload (per size) | High | Low–medium (PAM excepted) | | Speed / bandwidth | High | Low (fluidic), very low (SMA) | | Safety in contact | Engineered (sensors + control) | Intrinsic (passive, mechanical) | | Conformance to objects | Poor (needs a model) | Excellent (mechanical fitting) | | Robustness to impact/overload | Low (dents, strips gears) | High (absorbs, bounces) | | Modeling & control | Mature, exact | Immature, approximate | | Tether / autonomy | Cabled but standard | Usually tethered (fluidic stack) | | Cost of body | Medium–high | Low (silicone, molding) | | Cost of control hardware | Medium | High (valves, pumps, sensors) | The table is the whole argument in one place: soft and rigid are complementary, with almost no dimension where one is strictly better. You choose by what the job rewards. ## The hybrid rigid-soft future The all-soft autonomous robot is a beautiful research goal and a poor product strategy. The architecture that actually ships, and will keep shipping, is **hybrid**: a rigid robot for the parts that need precision, force, and controllability, with soft components where contact, conformance, and safety matter. You can already see it everywhere: - A rigid six-axis arm (precise positioning, payload, mature control) with a **soft gripper** (mGrip fingers, fin-ray) at the flange — precise transport, conformant grasp. - A rigid cobot with **soft skins** and compliant covers for passive safety on top of its torque-sensing — see [collaborative robots](/posts/collaborative-robots-cobots-ultimate-guide/). - A [humanoid](/posts/humanoid-robot-hardware-ultimate-guide/) with rigid limbs but compliant, soft-skinned fingertips and tactile pads where it touches the world. - Festo's own product logic: bionic soft demonstrators (BionicSoftArm, MultiChoiceGripper) feeding compliant components into otherwise rigid pneumatic automation. The reason hybrid wins is structural, not fashionable. The dimensions soft is good at (safety, conformance, robustness) and the ones rigid is good at (accuracy, force, control) barely overlap — so combining them is nearly free of tradeoff at the system level. You put softness exactly where contact happens and stiffness everywhere else. > **Final rule:** Don't ask "soft or rigid?" Ask "where in this machine does compliance pay, and where does it cost?" The answer is almost always *soft at the contact surface, rigid in the structure* — which is exactly what a human arm with a soft hand already is. What would change this calculus is a breakthrough in the bottleneck: small, cheap, high-bandwidth, untethered fluidic control, or a production-grade electric soft actuator (EAP/HASEL maturing out of the lab). If either lands, the soft fraction of the hybrid grows. Until then — and in 2026 we are firmly "until then" — bet on hybrid, deploy soft where it conforms and protects, and keep the tether budget in your plan. ## Frequently asked questions **What exactly makes a robot "soft"?** Its primary functional components are made of low-modulus material (roughly 10⁴–10⁹ Pa, silicone to soft plastic) so the body deforms to produce or accommodate motion, instead of rigid links pivoting at discrete joints. Compliance can come from the material, the structure (e.g. fin-ray), or both. A rigid robot with a foam cover is not a soft robot. **Why are most soft robots pneumatic?** Because pneumatics are cheap, force-dense, and inherently compliant, and air is easy to source. Fluidic elastomer actuators (PneuNets) and McKibben muscles both run on air. The downside — the bulky valve-and-pump control stack — is the price, and it's the reason soft robots are usually tethered. **What's the difference between a PneuNet and a McKibben muscle?** A PneuNet is a molded elastomer with internal chambers and a strain-limiting layer; inflating it makes it *bend*. A McKibben muscle (Festo Fluidic Muscle DMSP) is a bladder in a braided sleeve; pressurizing it makes it *contract* axially, like a biological muscle. PneuNets bend a lot at low force; McKibbens contract ~25% at high force. **How much force can a soft actuator produce?** Hugely variable. A small PneuNet finger exerts a few newtons (`F = P·A` at tens of kPa). A Festo DMSP-20 McKibben muscle pulls on the order of ~1,500 N at 6 bar, and a DMSP-40 reaches roughly ~6,000 N. Force in soft actuators is load-dependent and usually drops through the stroke. **Why can't soft robots move fast?** Fluidic actuation is bandwidth-limited: filling and emptying compliant chambers through finite-diameter tubing is slow, so most pneumatic soft actuators top out at single-digit hertz. SMA is even slower (cooling-limited). Only dielectric-elastomer actuators are intrinsically fast, and they're not yet production hardware. **What silicone should I use?** For high-strain bending actuators, Ecoflex 00-30 or 00-50 is the default. For tougher gripper fingers and wear surfaces, Dragon Skin 10A–30A. For optical-waveguide sensing you want an optically clear silicone. Pick durometer by the strain/force/durability tradeoff: softer bends more and lasts less. **How do you sense the shape of a soft robot?** With stretchable sensors — resistive (carbon composite, liquid-metal eGaIn channels), capacitive, optical waveguides, magnetic, or by self-sensing the actuating air pressure. None of them gives clean, drift-free shape data the way an encoder gives a joint angle, which is why proprioception is the field's hardest sensing problem. **Is closed-loop control of soft robots solved?** No. The model is approximate and nonlinear (constant-curvature is a starting point; FEM is accurate but too slow for real time), the state is hard to measure, and fluidic dynamics add lag and hysteresis. Most deployed soft systems run open-loop pressure control and rely on mechanical compliance instead of precise feedback. **What is the fin-ray effect?** A structural-compliance trick from fish-fin anatomy: a triangular finger with angled crossribs bends *toward* an applied load and wraps around it, with no actuation in the finger itself. Festo's FinGripper productized it; it's now a cheap, printable, food-friendly gripper finger driven by an ordinary parallel gripper. **Where is soft robotics actually deployed today?** Mostly in grasping: food and produce handling (Soft Robotics Inc mGrip, fin-ray grippers) and fragile/mixed-SKU pick in logistics. Medical/surgical continuum tools and soft exosuits are scaling. Whole-body soft robots, growing/vine robots, and untethered soft machines are still largely research. **Why are soft robots tethered?** Because the fluidic control hardware — pump/compressor, valves, regulators, sensors — is bulky and power-hungry, so it stays on a benchtop and air is piped to the robot. Putting the whole stack on board sacrifices the lightness that made the robot soft in the first place. Onboard fluidic control is the field's key open hardware problem. **Will soft robots replace rigid industrial robots?** No. They're complementary. Soft wins on contact, conformance, and fragility; rigid wins on force, speed, and precision — and those barely overlap. The durable architecture is hybrid: a rigid robot with soft end effectors and soft contact surfaces, which is exactly what's already shipping in food and logistics cells. ## Changelog - **2026-06-05** — Initial publication. --- # Robot Sensors: IMUs, Force/Torque & Proprioception — The Ultimate Guide URL: https://blog.robo2u.com/posts/robot-sensors-ultimate-guide/ Published: 2026-06-04 Updated: 2026-06-20 Tags: robot-sensors, imu, force-torque-sensor, tactile-sensors, proprioception, load-cell, sensor-fusion, robotics-hardware, guide Reading time: 35 min > A deep, practical guide to a robot's self-sensing stack: MEMS IMUs (bias, ARW, Allan variance), 6-axis force/torque sensors, current-based torque estimation, tactile/contact skin, load cells, ToF, and the sensor fusion that ties them together. A robot that cannot sense itself is a puppet on an open-loop string. Before a machine can navigate a room or grasp an object, it has to answer a more basic set of questions: which way is down, how fast am I rotating, where are my joints, and is something pushing back on me right now? Those questions are answered by the proprioceptive and contact sensing stack — the IMUs, encoders, force/torque sensors, current sensors, and tactile skins that let a robot model its own body and its physical contact with the world. This guide is about that inward- and contact-facing layer of sensing. It is deliberately *not* about cameras and LiDAR — those exteroceptive sensors get their own treatment in the [LiDAR & depth cameras guide](/posts/lidar-depth-cameras-ultimate-guide/). Here we go deep on inertial measurement, force and torque, tactile contact, and the short-range proximity sensing a robot uses to feel its immediate surroundings. We will derive the noise terms that actually matter on an IMU datasheet, explain why current-based torque estimation is the quiet workhorse of every cobot, and get concrete about real parts: Bosch BMI and BNO IMUs, ATI and Robotiq and Bota Systems force/torque sensors, TE and Honeywell load cells, ST VL53 ToF rangers, and tactile systems from GelSight and SynTouch. **The take**: exteroception gets the headlines, but proprioception and contact sensing are what make a robot *controllable*. A $4 MEMS IMU and a clean motor-current estimate do more for stability and safe contact than a $4,000 LiDAR does — and the hardest problems here are not the transducers but the noise, drift, calibration, timing, and fusion that turn raw counts into a trustworthy state estimate. Get the sensing stack right and your control loops feel telepathic; get it wrong and no amount of clever planning rescues a robot that does not know where its own hand is. Companion reading: [rotary encoders](/posts/encoders-ultimate-guide/), [LiDAR & depth cameras](/posts/lidar-depth-cameras-ultimate-guide/), [end-effectors & grippers](/posts/end-effectors-grippers-ultimate-guide/), and [collaborative robots (cobots)](/posts/collaborative-robots-cobots-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [The sensing stack: proprioception vs exteroception](#stack) 3. [IMUs deep-dive: accelerometers, gyros, magnetometers](#imu) 4. [IMU sensor fusion: filters, drift, and the yaw problem](#imu-fusion) 5. [Encoders & joint position as proprioception](#encoders) 6. [Force/torque sensing: 6-axis wrist sensors](#ft) 7. [Joint torque and current-based torque estimation](#joint-torque) 8. [Tactile & contact sensors](#tactile) 9. [Load cells, pressure, current, temperature, and the limit switch](#other) 10. [Range & proximity for self and near-field](#proximity) 11. [Sensor specs that matter and reading a datasheet](#specs) 12. [Sensor fusion & state estimation overview](#fusion) 13. [Selecting & integrating sensors](#selecting) 14. [Frequently asked questions](#faq) ## Key takeaways - **Proprioception** (the robot sensing its own body — joint angles, body attitude, joint torques) and **contact sensing** (force/torque, tactile) are distinct from **exteroception** (vision, LiDAR, depth). This guide covers the first two; vision lives in the [LiDAR & depth guide](/posts/lidar-depth-cameras-ultimate-guide/). - A **MEMS IMU** combines a 3-axis accelerometer and 3-axis gyroscope (6-axis); add a magnetometer for 9-axis. The accelerometer gives you a long-term gravity reference; the gyro gives clean short-term rotation rate; the mag gives an absolute heading — and each has a failure mode the others cover. - The IMU specs that decide your result are **noise density** (µg/√Hz, °/s/√Hz), **angle random walk** (ARW, °/√h), **bias instability** (°/h), and **bias repeatability**. Allan variance is how you read all of them off one log. - **Gyro integration drifts** because bias is integrated into a growing angle error; the accelerometer corrects roll and pitch against gravity, but **yaw has no gravity reference** — without a magnetometer or vision, heading drifts unbounded. - **Complementary filters** (Mahony, Madgwick) are cheap and excellent for attitude on small robots; **Kalman/EKF** estimators win when you must fuse heterogeneous, time-stamped sensors and want a covariance you can trust. Use the simplest one that meets spec. - **Joint position** is proprioception too — usually an encoder per joint. For depth on encoders see the [encoders guide](/posts/encoders-ultimate-guide/); here we treat them as one input to the state estimate. - **6-axis force/torque sensors** (ATI, Robotiq FT 300, Bota Systems) measure Fx/Fy/Fz and Tx/Ty/Tz at the wrist via strain-gauge or capacitive bridges. The numbers that bite you are **crosstalk**, **overload rating**, and **thermal/zero drift**, not the headline full-scale range. - **Current-based torque estimation** — inferring joint torque from motor phase current via `τ ≈ Kt · I` — is the trick that makes most cobots force-aware without a torque sensor per joint. It is cheap and fast but corrupted by friction, gear losses, and Kt variation; true joint-torque sensors (strain gauges on the output) are more accurate and far more expensive. - **Tactile sensors** for grippers come in capacitive, resistive (FSR), barometric (MEMS pressure under elastomer), and optical (GelSight) flavors. Optical tactile gives the richest data (sub-millimeter geometry, slip, shear) at the cost of a camera, latency, and bulk. - **ToF rangers** (ST VL53L series) give absolute distance from ~1 cm to ~4 m at low cost; ultrasonic handles acoustically reflective targets vision misses; inductive/capacitive proximity switches are the rugged binary workhorses of industrial cells. - **Timing and synchronization** matter as much as the transducer. Fusing a 1 kHz IMU with a 30 Hz camera or a CAN-bus torque reading demands timestamps and an understanding of latency; a 5 ms timing error on a 1 kHz balance loop is a fall. - Pick sensors by **range, resolution, bandwidth, noise/drift, latency, and interface (SPI/I²C/CAN/EtherCAT)** — and budget calibration and mounting as first-class engineering, not afterthoughts. ## The sensing stack: proprioception vs exteroception Every robot's sensing splits cleanly into two families, and confusing them is the source of a lot of bad architecture decisions. **Proprioception** is the robot sensing *itself*: the angles of its joints, the attitude and angular rate of its body, the torques in its drivetrain, the temperature of its motors. The word is borrowed from biology — your proprioceptive sense is how you know where your hand is with your eyes closed. A robot's proprioception comes from encoders, IMUs, joint torque sensors, and motor current. **Exteroception** is the robot sensing the *world*: cameras, LiDAR, depth sensors, microphones. This is how a robot perceives objects, free space, and other agents. It is covered in the [LiDAR & depth cameras guide](/posts/lidar-depth-cameras-ultimate-guide/) and is out of scope here. Sitting at the boundary is **contact sensing** — force/torque sensors and tactile skin. Contact sensing is technically exteroceptive (it measures the world pushing on the robot) but it is so tightly coupled to manipulation control and so similar in character to proprioception (high rate, on-body, fused into the control loop) that it belongs in this guide alongside IMUs and torque sensing. > **Rule of thumb**: proprioception keeps the robot *stable and safe*; exteroception lets it be *useful*. You can build a robot that balances and complies with zero cameras. You cannot build one that does anything intelligent with the world without exteroception. Both layers matter; this guide is the first. ### What a robot must measure about itself Strip a mobile manipulator or a legged robot down to its control needs and the proprioceptive shopping list is short and non-negotiable: - **Body attitude** (roll, pitch, yaw) and **angular rate** — from an IMU. Required for any balancing or flying machine; useful for everything. - **Joint positions** — one encoder per actuated joint. Required for any articulated arm or leg. - **Joint velocities** — usually differentiated from position, sometimes measured directly. - **Joint torques or contact forces** — from current estimation, joint torque sensors, or a wrist F/T sensor. Required for compliant control, force tasks, and collision detection. - **Motor and electronics temperatures** — for thermal protection and I²t modeling (see the [motor controllers & FOC guide](/posts/motor-controllers-foc-ultimate-guide/)). The rest of this guide walks each of these, plus the contact and proximity sensing that rounds out the picture, then ties them together with state estimation. ## IMUs deep-dive: accelerometers, gyros, magnetometers The Inertial Measurement Unit is the single most important proprioceptive sensor on any robot that moves its whole body — a drone, a legged robot, a humanoid, a balancing platform. It answers "which way is down" and "how fast am I rotating," and it does so at hundreds to thousands of hertz with no dependence on the environment. ### The three transducers A modern IMU packs up to three sensor types into one MEMS die: - **3-axis accelerometer** — measures specific force (acceleration minus gravity) along three orthogonal axes, in g or m/s². At rest it reads the 1 g gravity vector, which makes it a tilt sensor: knowing where "down" points fixes roll and pitch. It is noisy and picks up every vibration, but it does *not* drift — its long-term average is anchored to gravity. - **3-axis gyroscope** — measures angular rate (°/s or rad/s) about three axes. Integrate rate over time and you get angle. Gyros are clean and fast in the short term but their bias integrates into unbounded angle drift. - **3-axis magnetometer** — measures the local magnetic field (in µT or gauss), giving an absolute heading reference like a compass. Indispensable for yaw, but easily corrupted by motors, currents, and ferrous structure. A **6-axis IMU** is accel + gyro. A **9-axis IMU** (sometimes called an AHRS-grade or MARG sensor) adds the magnetometer. The accel and gyro are complementary by design: the gyro is trustworthy short-term, the accel trustworthy long-term, and fusing them (next section) gives a drift-free attitude in roll and pitch. ### MEMS, and how these things actually work Nearly every robot IMU is **MEMS** (micro-electro-mechanical systems): tiny silicon structures etched on a chip. A MEMS accelerometer is a proof mass on silicon springs whose deflection changes a capacitance; a MEMS gyro is a vibrating mass whose Coriolis-induced lateral motion is sensed capacitively when the chip rotates. The whole thing is the size of a grain of rice and costs a few dollars. The trade is precision. MEMS IMUs are cheap, small, low-power, and rugged, but orders of magnitude less stable than the **fiber-optic gyros (FOG)** and **ring-laser gyros (RLG)** used in aircraft and missiles. A tactical- or navigation-grade FOG can hold bias to better than 0.01 °/h; a commodity MEMS gyro might be 10–100 °/h. For robotics, MEMS is almost always the right answer — you fuse it with encoders and vision rather than paying for a $30,000 navigation IMU. ### The Bosch lineup, concretely Two product families dominate robotics: - **Bosch BMI series** (e.g. **BMI088**, **BMI270**, **BMI323**) — raw 6-axis accel+gyro parts. The BMI088 is a favorite on drones and robot flight controllers: it is specified for high vibration, with a gyro noise density around **0.014 °/s/√Hz** and an accel noise density around **175 µg/√Hz**. You run your own fusion on the host. - **Bosch BNO055 / BNO085 (BNO08x)** — "smart" 9-axis sensors with an on-chip processor running the fusion (Bosch's BSX / Hillcrest's SH-2 algorithms). They output a fused quaternion directly. Convenient when you do not want to write a filter, at the cost of being a black box you cannot fully tune. Other common parts: the **InvenSense/TDK ICM-20948** and **ICM-42688** (the 42688 is a low-noise 6-axis part popular on newer flight controllers), and the **Analog Devices ADIS16xxx** industrial IMUs (e.g. ADIS16505) when you need calibrated, tactical-grade performance in a module. > **Rule of thumb**: if you are writing the control loop, buy a raw 6-axis part (BMI088, ICM-42688) and run your own fusion — you keep timing control and tuning. Reach for a BNO08x only when you want attitude with zero filter code and can live with a fixed output rate. ### The error terms that actually matter Here is where datasheets earn their keep. The headline "±2000 °/s range, 16-bit" tells you almost nothing about whether the IMU will drift your robot into a wall. These five terms do: | Spec | Units | What it means | Why you care | |---|---|---|---| | **Noise density** | °/s/√Hz (gyro), µg/√Hz (accel) | White noise per √bandwidth | Sets the noise floor; multiply by √bandwidth for RMS noise at your rate | | **Angle Random Walk (ARW)** | °/√h | How fast white-noise-driven angle error grows | The unavoidable short-term integration error of the gyro | | **Velocity Random Walk (VRW)** | (m/s)/√h | Accel equivalent of ARW | Position error growth from accel integration | | **Bias instability** | °/h (gyro), µg (accel) | The floor of slow bias drift (flicker noise) | The best stability you can get even after calibration — the bottom of the Allan curve | | **Bias repeatability / turn-on bias** | °/s, mg | How much bias changes run-to-run | Forces a re-zero at each startup; affects how long you must hold still | | **Scale factor error** | ppm or % | Gain error of the transducer | Multiplies with the true rate; matters at high rates/accelerations | | **Cross-axis sensitivity** | % | Leakage between axes from imperfect alignment | Couples motion on one axis into another; calibratable | **Noise density to RMS noise**: if a gyro is rated 0.01 °/s/√Hz and you sample at a bandwidth of 100 Hz, the RMS angular-rate noise is roughly `0.01 × √100 = 0.1 °/s`. Cut your bandwidth and you cut noise — at the cost of latency. **ARW** is the term that tells you how badly the gyro drifts in the short term. A gyro with ARW of 0.3 °/√h accumulates about 0.3° of angle uncertainty after one hour from white noise alone — but the more honest reading for robotics is the per-second growth: that same gyro drifts on the order of 0.005°/√s, which compounds over a minute-long unaided integration. ### Allan variance: reading all of this off one log The **Allan variance** (or its square root, Allan deviation) is the standard tool for separating an IMU's noise terms. You log the gyro at rest for a long time (hours), then plot the Allan deviation against averaging time τ on a log-log scale. The curve has characteristic slopes: - A **−1/2 slope** at short τ → **angle random walk** (white noise). Read ARW where this line crosses τ = 1 s (or 1 h, by convention). - A **flat minimum** → **bias instability**. The lowest point of the curve is the best bias stability you can hope for. - A **+1/2 slope** at long τ → **rate random walk** (the bias itself drifts). ```text Allan deviation σ(τ), log-log: σ │ \ / │ \ slope -1/2 / slope +1/2 │ \ (random walk) / (rate random walk) │ \____ / │ \___ ____/ │ \_____/ │ ^ bias instability (flat minimum) └────────────────────────────────── τ (averaging time) ``` The practical workflow: log your specific IMU on your specific board (vibration and temperature change everything), compute the Allan deviation, and pull ARW and bias instability from the curve. Those numbers feed directly into your filter's process-noise tuning. This is one of the few places where a datasheet number is no substitute for measuring your own hardware. ## IMU sensor fusion: filters, drift, and the yaw problem A raw IMU is useless until you fuse its channels into an attitude estimate. The fusion problem is specific and well understood: the gyro is trustworthy over short intervals but drifts; the accelerometer is trustworthy over long intervals (gravity) but is noisy and corrupted by linear acceleration. Combine them so each covers the other's weakness. ### The complementary filter The cheapest good fusion is the **complementary filter**. It high-pass-filters the integrated gyro angle (keeping its clean short-term behavior, rejecting its slow drift) and low-pass-filters the accelerometer-derived angle (keeping its drift-free long-term behavior, rejecting its noise), then sums them. In one line: ```text # Complementary filter for a single tilt axis (per timestep dt): # theta_gyro = previous angle + gyro_rate * dt (integrate gyro) # theta_accel = atan2(accel_y, accel_z) (gravity-derived tilt) alpha = tau / (tau + dt) # tau = filter time constant, ~0.5-2 s theta = alpha * (theta + gyro_rate * dt) + (1 - alpha) * theta_accel # alpha ~ 0.98 means "trust the gyro for fast motion, # slowly pull toward the accelerometer for the DC truth." ``` A complementary filter is a handful of lines, runs at any rate, and is genuinely excellent for roll/pitch attitude on drones and small robots. Its limits: it assumes the accelerometer reads pure gravity, so high linear acceleration (a hard maneuver) temporarily corrupts the correction. Tune `alpha` higher to trust the gyro more during dynamics. ### Mahony and Madgwick **Mahony** and **Madgwick** filters are the production-grade complementary filters used across the drone and robotics world. Both fuse 6- or 9-axis data into a quaternion. **Mahony** uses a PI controller on the gravity/magnetic error to drive the gyro-bias estimate; **Madgwick** uses a gradient-descent step to align the predicted gravity (and magnetic) vector with the measurement. Both are cheap enough for an 8-bit MCU, both expose a single `beta`/`Kp` gain trading responsiveness against smoothness, and both are battle-tested. For an embedded attitude estimate without the machinery of a Kalman filter, Madgwick is the default. ### Kalman and the EKF When you must fuse heterogeneous, asynchronous, time-stamped sensors — IMU plus encoders plus a wheel odometer plus an occasional vision fix — and you want a principled estimate *with a covariance*, you graduate to a **Kalman filter**. Because attitude/orientation is nonlinear (quaternions, trig), you use the **Extended Kalman Filter (EKF)**, which linearizes around the current estimate, or the **Unscented Kalman Filter (UKF)** / an **error-state EKF (ESEKF)** for better handling of the nonlinearity. The EKF's advantages over a complementary filter: it tracks gyro bias as a state (so it learns and removes drift), it produces a covariance (so downstream consumers know how much to trust the estimate), and it cleanly incorporates new measurements at their own rates and latencies. Its costs: more compute, more states to tune, and a process/measurement-noise model you have to get right (this is where your Allan-variance numbers go). The PX4 and ArduPilot autopilots, and essentially every serious legged robot, run an EKF or ESEKF for state estimation. > **Rule of thumb**: use a Madgwick/Mahony complementary filter when you need attitude and your sensors are just an IMU (± mag). Move to an EKF when you must fuse encoders, odometry, GPS, or vision, or when you need a covariance for a downstream estimator. Do not reach for an EKF to do a job a 20-line complementary filter does fine. ### Why yaw is the hard one Here is the asymmetry that trips up newcomers: **roll and pitch are observable; yaw is not — at least not from an accelerometer.** The accelerometer measures the gravity vector, which points down. Rotating the robot in roll or pitch tilts that vector relative to the body, so the accel sees it and can correct gyro drift. But rotating in **yaw** (heading) spins the robot *around* the gravity vector — gravity looks identical before and after. The accelerometer is blind to yaw. The consequence: with a 6-axis IMU (no magnetometer), **yaw drifts without bound.** There is nothing to correct the integrated gyro heading. Over minutes, a 6-axis estimate can wander tens of degrees in yaw while roll and pitch stay rock solid. To bound yaw you need an absolute heading reference: - A **magnetometer** (the 9-axis solution) — gives a compass heading, but is fragile near motors, high currents, and ferrous structure. Needs hard-iron/soft-iron calibration. - **Vision/SLAM or LiDAR odometry** — corrects yaw from environmental features (see the [LiDAR & depth guide](/posts/lidar-depth-cameras-ultimate-guide/)). - **Wheel odometry** on a ground robot, or a GPS course-over-ground outdoors. This is *the* reason indoor robots without a clean magnetometer or vision fix slowly rotate their world model, and why "my robot thinks it is facing the wrong way after a few minutes" is almost always a yaw-observability problem, not a bug. ## Encoders & joint position as proprioception Joint position is proprioception, and for an articulated robot it is the proprioceptive signal — without it forward kinematics is impossible and you cannot know where the end-effector is. The transducer is almost always a **rotary encoder** on each joint. Encoders have their own full treatment, so this section is deliberately brief — see the [rotary encoders guide](/posts/encoders-ultimate-guide/) for incremental vs absolute, optical vs magnetic vs capacitive, single- vs multi-turn, resolution/accuracy, and the on-axis magnetic chips (AS5047, AS5048, MA732) that dominate robot joints. For *this* guide, the points to carry forward: - **Absolute encoders** report position without a homing move — essential for joints that must know where they are at power-on. **Incremental encoders** count from an index and need homing. - A joint typically wants **both** a high-resolution encoder on the motor (for commutation and velocity) and an absolute encoder on the gearbox output (for true joint angle, immune to backlash) — standard on harmonic-drive cobot joints (see [gearboxes](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/)). - **Velocity** is usually differentiated from position, amplifying quantization noise — encoder resolution directly limits velocity-loop quality. - In the state estimate, joint encoders are the most trusted proprioceptive input: low noise, no drift, high rate. The IMU and torque sensors play supporting roles around them. ## Force/torque sensing: 6-axis wrist sensors When a robot must control *contact* — insert a peg, polish a surface, deburr an edge, assemble a connector — it needs to measure the forces and torques at its end-effector. The instrument is a **6-axis force/torque (F/T) sensor**, mounted at the wrist between the robot flange and the tool. ### What it measures and how A 6-axis F/T sensor reports a full wrench: three forces (**Fx, Fy, Fz**) and three torques (**Tx, Ty, Tz**) in the sensor frame. Internally, most do it with **strain gauges** bonded to a precisely machined elastic element (often a spoked "Maltese cross" hub). When force or torque deforms the element, the gauges change resistance; arranged in **Wheatstone bridges**, those tiny resistance changes become voltages. Six (or more) bridges, a calibration matrix, and you decode the full 6-DOF wrench. Two implementation families: - **Strain-gauge (resistive)** — the classic. ATI Industrial Automation's sensors (the **Nano**, **Mini**, **Gamma**, **Delta** families) are the reference. High accuracy and stiffness, mature, but the bridge signals need careful amplification and temperature compensation. - **Capacitive / MEMS** — newer designs (some Bota Systems and OnRobot/Robotiq units) measure the elastic element's deflection capacitively. They can integrate signal conditioning and even an IMU on the same board, and tend to have excellent noise performance and built-in compensation. ### The specs that actually bite The headline number is full-scale range (e.g. ±200 N, ±10 N·m). It is rarely what limits you. The specs that cause real grief: | Spec | What it means | Why it bites | |---|---|---| | **Crosstalk (cross-axis coupling)** | A pure Fz reads as a spurious Fx/Tx | Limits how cleanly you can resolve one axis during multi-axis loading; typically 1–5% of full scale | | **Overload rating** | Force beyond which the element yields or breaks | A collision can exceed full scale by 5–10×; the sensor must survive it. Overload is often quoted per-axis (e.g. 5× Fz) | | **Zero / thermal drift** | Output drift with temperature and time | A warming sensor or motor heat shifts the zero by newtons over minutes; you re-bias before force tasks | | **Resolution** | Smallest resolvable force | A Nano17 resolves down to ~1/160 N; a Delta resolves ~1/8 N — pick the range that gives resolution where you need it | | **Stiffness / bandwidth** | How stiff the element is, mechanical resonance | A stiff sensor preserves position accuracy and raises bandwidth (hundreds of Hz to kHz); a compliant one acts as an unwanted spring | | **Noise** | Output noise at rest | Sets the smallest contact force you can reliably detect | > **Rule of thumb**: size an F/T sensor for *resolution at your task force*, then check that its overload rating survives your worst-case collision. Picking a ±500 N sensor for 5 N assembly forces wastes your resolution; picking a ±10 N sensor that breaks on a 60 N crash wastes the sensor. ### Real products | Sensor | Type | Typical range (Fz / Tz) | Notes | |---|---|---|---| | **ATI Nano17** | Strain gauge | ±50 N / ±0.5 N·m | Tiny (17 mm), fingertip-scale, very high resolution | | **ATI Gamma** | Strain gauge | ±400 N / ±20 N·m | Industrial workhorse for arm wrists | | **Robotiq FT 300-S** | Strain gauge | ±300 N / ±30 N·m | Plug-and-play for UR cobots, integrated comms | | **Bota Systems Rokubi / MiniONE** | Capacitive (some w/ IMU) | ±200–500 N / ±5–20 N·m | EtherCAT/USB/CAN, on-board IMU option, low drift | | **OnRobot HEX-E / HEX-H** | Strain gauge | ±200 / ±400 N | Cobot-targeted, 6-axis | | **Schunk FT** | Strain gauge | wide | Robust industrial line | A wrist F/T sensor is the right tool when you need *accurate, full 6-DOF contact wrench at the tool* — assembly, polishing, force-controlled testing. It is overkill (and a single point of fragility) when current-based torque estimation at the joints already gives you enough contact awareness for collision detection — which is the next section. ## Joint torque and current-based torque estimation There are two ways to know the torque in a robot joint, and the choice between them defines a robot's cost and capability. ### Option A: true joint torque sensors Put a strain-gauge transducer in the joint's torque path — typically on the output side, after the gearbox. This directly measures the torque the joint delivers (or absorbs), immune to friction and gear losses upstream. This is what high-end torque-controlled robots do: the **Franka Emika / Franka Research 3** has a torque sensor in *every one* of its 7 joints, which is what gives it its exquisite compliance and sensitivity. The Kuka LBR iiwa does the same. The cost is real: a torque sensor per joint adds expense, complexity, and a wiring/calibration burden at every axis. ### Option B: current-based torque estimation (the cobot trick) Most cobots and many quadrupeds skip per-joint torque sensors and instead *infer* torque from the **motor current**. In a PM motor, torque is proportional to the torque-producing current: ```text # Motor torque from phase current (q-axis current in FOC): tau_motor = Kt * Iq # Kt = torque constant [N·m/A], Iq = q-axis current [A] # Joint output torque, accounting for gearing and losses: tau_joint = (Kt * Iq * N * eta) - tau_friction # N = gear ratio # eta = gearbox efficiency (~0.6-0.9 for harmonic/cycloidal) # tau_friction = Coulomb + viscous friction (speed-dependent, modeled or learned) ``` The motor controller already measures `Iq` precisely to run field-oriented control (see the [motor controllers & FOC guide](/posts/motor-controllers-foc-ultimate-guide/)), so the torque estimate is *free* — no extra sensor, no extra wiring, full motor bandwidth. This is why a Universal Robots arm, or a quadruped like Unitree's, can be force-aware and collision-sensitive without a single dedicated torque sensor. The catch is accuracy. The estimate `Kt · Iq · N · η` is corrupted by: - **Friction** — Coulomb (constant) and viscous (speed-dependent) friction in the bearings and gearbox. This dominates the error and must be modeled or learned per joint. - **Gear efficiency** — harmonic and cycloidal drives lose 10–40% of torque, and the loss varies with load, speed, and temperature. - **Kt variation** — the torque constant drifts with temperature (magnet strength) and is not perfectly known. - **Backlash and elasticity** — the gearbox is not a rigid link; under dynamic loads the relationship smears. The result: current-based torque is excellent for **collision detection** and **gross compliance** (the cobot stopping when you bump it, gravity compensation, hand-guiding) but mediocre for **precise force control** at low forces. The friction floor means you typically cannot resolve joint torques below several percent of the joint's rating from current alone. > **Rule of thumb**: current-based torque estimation is "good enough to be safe and compliant, not good enough to thread a needle." If you need fine force control at the tool, add a wrist F/T sensor. If you need fine torque control at every joint, pay for joint torque sensors. For collision detection and hand-guiding, current estimation is the right, cheap answer — and it is why cobots are affordable (see the [cobots guide](/posts/collaborative-robots-cobots-ultimate-guide/)). ### Series elastic actuators: torque from deflection A third path deserves mention: the **series elastic actuator (SEA)** deliberately inserts a calibrated spring between the gearbox and the load, then measures the spring's deflection (with an encoder) to compute torque via Hooke's law, `τ = k · Δθ`. This turns torque sensing into position sensing — cheap and robust — and the spring adds shock tolerance and intrinsic compliance. The downside is the spring reduces control bandwidth and stiffness. SEAs show up on legged robots and some collaborative designs; see the [legged/quadruped guide](/posts/legged-quadruped-robot-hardware-ultimate-guide/). ## Tactile & contact sensors A wrist F/T sensor tells the robot the *net* wrench at the tool. A **tactile sensor** tells it what is happening at the *contact surface itself* — where the contact is, its shape, whether it is slipping, the pressure distribution. Tactile sensing is to the gripper what skin is to a fingertip, and it is the enabling technology for dexterous manipulation (see the [end-effectors & grippers guide](/posts/end-effectors-grippers-ultimate-guide/)). ### The technology families | Type | Principle | Strengths | Weaknesses | |---|---|---|---| | **Resistive (FSR)** | Force-sensitive resistor changes resistance under pressure | Cheap, thin, simple | Poor accuracy, hysteresis, drift; mostly binary/coarse pressure | | **Capacitive** | Pressure changes plate spacing/area → capacitance | Sensitive, good for arrays, low power | Susceptible to EMI; needs guarding | | **Barometric (MEMS pressure)** | Tiny MEMS pressure sensor under an elastomer dome | Cheap, robust, calibratable, good range | One sensor = one taxel; coarse spatial resolution | | **Optical / vision-based** | Camera images a deformable gel membrane | Extremely rich data: geometry, slip, shear, texture | Bulky, camera latency, compute-heavy | | **Piezoresistive / MEMS arrays** | Micromachined pressure-sensitive array | High spatial resolution | Fragile, expensive | ### Optical tactile: GelSight and friends The standout of the last decade is **optical (vision-based) tactile** sensing. A **GelSight** sensor is, in essence, a small camera looking up at the underside of a soft, coated elastomer pad through internal illumination. When the pad presses against an object, it deforms to the object's shape; the camera images that deformation. Photometric-stereo reconstruction turns the image into a height map with **sub-10-micron** depth resolution — you can read the embossing on a coin, detect the onset of slip from shear deformation of printed markers, and estimate contact force from the bulk deformation. The trade-offs are real: a GelSight-style fingertip is bulkier than a flat pad, adds camera latency (tens of milliseconds), needs compute to process the image, and the gel wears and must be replaced. But for research-grade dexterity the data richness is unmatched. The MIT-originated GelSight, the open **GelSight Mini**, and Meta's open-source **DIGIT** sensor are the reference designs. ### Multimodal tactile: SynTouch **SynTouch's BioTac** takes a biomimetic route: a fingertip-shaped sensor with a fluid-filled elastomer skin over an electrode-studded core. It senses three modalities at once — **pressure** (impedance changes as the fluid thins under load), **vibration** (a hydro-acoustic sensor catches the micro-vibrations of slip and texture), and **temperature/heat-flux** (which encodes thermal properties — metal feels different from wood). It is the closest thing to a synthetic human fingertip and is used heavily in dexterity and material-recognition research. > **Rule of thumb**: use barometric or capacitive taxel arrays for affordable, robust grip-force and contact-presence sensing on production grippers. Reach for optical (GelSight/DIGIT) or BioTac when the research goal is *dexterity* — slip detection, in-hand pose, fine geometry — and you can afford the bulk, latency, and compute. ### What tactile gives you that F/T does not Slip detection is the headline. A wrist F/T sensor sees that grip force dropped but cannot localize *where* the object is slipping; a tactile array or optical sensor sees the incipient shear at the contact patch and can trigger a grip-force increase *before* the object falls. Tactile also gives contact localization, shape, and texture — all of which a single 6-axis wrench cannot. ## Load cells, pressure, current, temperature, and the limit switch Beyond the marquee sensors, a working robot is studded with humbler transducers that are easy to overlook and costly to omit. ### Load cells A **load cell** is a single- (or few-) axis force sensor — the strain-gauge element behind every digital scale. Robots use them for payload weighing, force-controlled pressing along one axis, and as the force element inside grippers and SEAs. Common forms: **S-beam**, **bending-beam**, **pancake/donut**, and **button** cells. Suppliers like **TE Connectivity**, **Honeywell** (Model 31, FSS/FMA series), **HBM**, and **Futek** span from sub-gram to multi-ton cells. The figure of merit is accuracy class (often a fraction of full scale, e.g. 0.1% FS), plus the same enemies as any strain device — temperature drift, creep (output slowly changing under sustained load), and nonlinearity. A load cell needs a stable, low-noise amplifier; the **HX711** 24-bit ADC is the ubiquitous cheap front end, while industrial setups use proper bridge amplifiers. ### Pressure sensors Two distinct uses. **Pneumatic/hydraulic pressure** sensors monitor the air or fluid driving soft actuators, suction grippers, and pneumatic systems (vacuum gripper feedback is a common case — see the [grippers guide](/posts/end-effectors-grippers-ultimate-guide/)). **Barometric** pressure sensors (e.g. **Bosch BMP388/BMP390**) double as altimeters on drones, giving a relative-altitude estimate that fuses with the IMU and GPS to stabilize vertical position to a meter or so. ### Current sensing Motor current is doing double duty: it is the inner loop of FOC *and*, as we saw, the basis of torque estimation. Current is measured with **shunt resistors** (cheap, accurate, but reference low-side or need isolation) or **Hall-effect sensors** (e.g. Allegro ACS series — galvanically isolated, no insertion loss). Bus and battery current sensing also feeds power budgeting and fault detection. See the [motor controllers & FOC guide](/posts/motor-controllers-foc-ultimate-guide/) for how the current loop uses it. ### Temperature Motor windings, power transistors, batteries, and gearboxes all need thermal monitoring. Transducers range from cheap **NTC thermistors** (winding and ambient), to **RTDs (PT100/PT1000)** for accuracy, to **thermocouples** for high temperature, to the on-die temperature sensors in every modern MCU and gate driver. Thermal data feeds **I²t models** that protect motors from overheating during sustained high-current operation. ### The limit switch and bump sensor Do not over-engineer. A **mechanical limit switch** is the most reliable position-reference and end-of-travel detector ever built: a binary, latching, zero-software signal that an axis has reached a hard stop or home position. Robots still use them for homing, end-stops, and safety interlocks. A **bump sensor** (a switch behind a compliant bumper, as on every robot vacuum) is the cheapest possible collision detector. **Hall-effect** and **reed switches** give the same binary information without contact wear. > **Rule of thumb**: reach for the simplest transducer that answers the question. If "did the axis reach home?" is a yes/no, a $1 microswitch beats a $200 absolute encoder for that *specific* job. Save the expensive sensors for the questions that are genuinely analog. ## Range & proximity for self and near-field Between "the robot's own body" and "the full 3D map of the room" sits a band of **short-range and proximity sensing** — knowing how far a surface is, or simply whether something is *there*, within a few centimeters to a few meters. This is distinct from the long-range mapping handled by LiDAR and depth cameras (see the [LiDAR & depth cameras guide](/posts/lidar-depth-cameras-ultimate-guide/)); here we cover the cheap, on-body rangers and switches. ### Time-of-Flight (ToF) rangers A **ToF** sensor emits a pulse of (usually infrared) light and times how long it takes to return — distance is `d = c · t / 2`. The dominant family is ST Microelectronics' **VL53** line (**VL53L0X**, **VL53L1X**, **VL53L4CX**, **VL53L8** multizone): single-chip laser rangers the size of a grain of rice, costing a few dollars, with ranges from about **1 cm to 4 m** and millimeter-class resolution at close range. They talk **I²C**, draw little power, and are everywhere — cliff detection on robot vacuums, object presence in grippers, short-range obstacle sensing, and gesture detection. The newer multizone parts (VL53L8 with an 8×8 grid) blur the line into a tiny depth sensor. ToF limits: range falls on dark, non-reflective, or angled surfaces, and ambient sunlight can swamp the return outdoors. They are a near-field tool, not a mapping sensor. ### Ultrasonic Ultrasonic rangers (the classic **HC-SR04**, or rugged industrial units from Pepperl+Fuchs and Banner) time an acoustic pulse instead of light. Their virtue is that they see what optical sensors miss — **glass, clear plastic, shiny or transparent surfaces** reflect sound fine even when they fool a laser. Their vices are a wide beam cone (poor angular resolution), slow update (sound is slow — ~340 m/s, so a 1 m round trip is ~6 ms), and trouble with sound-absorbing or angled targets. Good for coarse presence and liquid-level sensing; weak for precise localization. ### IR proximity and reflective sensors Cheap **IR reflective** sensors (an IR LED and a photodiode) give an analog "something is close and reflective" signal at a few centimeters; **Sharp GP2Y** rangers triangulate distance optically out to tens of centimeters. Coarse and surface-dependent but trivially cheap — line-following, edge detection, crude obstacle sensing on small robots. ### Inductive and capacitive proximity switches In industrial cells, the rugged binary workhorse is the **inductive proximity switch**: a sealed barrel sensor that detects a *metal* target within a few millimeters by the eddy currents it induces, with no contact, no wear, and an IP67/IP69K rating that shrugs off coolant and dust. **Capacitive proximity** switches detect any material (including liquids and non-metals) by a change in capacitance. Both are the unglamorous, indestructible presence detectors that confirm a part is in a fixture or a gripper is at a station — far more reliable in a dirty cell than any optical sensor. | Sensor | Range | Resolution | Best at | Weak at | |---|---|---|---|---| | **ToF (VL53)** | 1 cm–4 m | mm-class | Cheap precise short range | Dark/angled/transparent, sunlight | | **Ultrasonic** | 2 cm–4 m | cm-class | Glass, shiny, transparent | Angular resolution, speed | | **IR reflective** | 1–80 cm | coarse | Ultra-cheap presence | Surface color/reflectivity | | **Inductive prox** | 1–15 mm | binary | Rugged metal detection | Only metals, very short range | | **Capacitive prox** | 1–25 mm | binary | Any material, rugged | Short range, env sensitivity | ## Sensor specs that matter and reading a datasheet Across every sensor in this guide, the same handful of specifications decide whether it works in your loop. Learn to read these and you can size any sensor. - **Range (full scale)** — the span of values the sensor measures (±2000 °/s, ±300 N, 0–4 m). Pick a range that covers your worst case with headroom but is not so large it wastes resolution. - **Resolution** — the smallest change the sensor can report. For digital sensors this is partly the ADC: an N-bit ADC over a range R gives a quantization step of `R / 2^N`. A 16-bit gyro over ±2000 °/s resolves about 0.06 °/s per count. - **Accuracy** — how close the reading is to truth, after calibration. Distinct from resolution: a sensor can be high-resolution and inaccurate (precise but biased). - **Bandwidth** — the frequency range the sensor tracks faithfully, set by its internal filtering. A 100 Hz bandwidth sensor cannot report a 500 Hz vibration. Higher bandwidth means faster response but more noise (you integrate noise over more frequencies). - **Noise** — random variation at constant input, quoted as RMS, noise density, or peak-to-peak. Noise trades against bandwidth; you reduce it by filtering, which costs latency. - **Drift** — slow change in output over time and temperature at constant input. Bias drift is the silent killer of integrated quantities (gyro angle, accel position). Always check the *thermal* drift spec, not just the room-temperature number. - **Latency** — the delay from a physical event to the sensor reporting it. Internal filtering, sampling, and digital-bus transport all add latency. In a fast control loop, latency is phase lag, and phase lag is instability. - **Repeatability** — does the sensor give the same reading for the same input, run to run? More important than absolute accuracy for many control tasks, where you can calibrate out a fixed offset but not a wandering one. > **Rule of thumb**: there is no free lunch between **bandwidth, noise, and latency.** You can have low noise (heavy filtering, high latency), high bandwidth (light filtering, more noise), or low latency (light filtering, more noise) — pick two, and pick them to match your control loop, not the datasheet's hero number. ### Reading a datasheet without getting fooled A few traps to watch for: - **"Resolution" vs "accuracy."** A 16-bit output does not mean 16 bits of *accurate* data — the lower bits are often pure noise. Look for the noise spec and ENOB, not just the ADC width. - **Typical vs guaranteed.** Most sexy numbers are "typical" at 25 °C. The min/max-over-temperature numbers are what you design to. - **Conditions matter.** Noise density is quoted at a stated bandwidth; F/T resolution is quoted single-axis. Read the footnotes. - **Full-scale percentages hide absolute errors.** "0.5% FS" on a ±500 N sensor is ±2.5 N — possibly larger than the force you are trying to control. ## Sensor fusion & state estimation overview No single sensor gives a robot a complete, trustworthy picture of its state. Each has a blind spot: the gyro drifts, the accelerometer is noisy and confused by motion, the encoder is blind to base motion, the camera is slow and occasionally wrong, the current-based torque estimate is corrupted by friction. **Sensor fusion** is the art of combining them so the fused estimate is better than any input — each sensor covering another's weakness. ### Why you fuse The classic example is the IMU complementary filter from earlier: gyro (fast, drifty) plus accelerometer (slow, drift-free) yields attitude that is both fast *and* drift-free. Scale that idea up to a full robot and you get a **state estimator** that fuses: - IMU (body attitude, angular rate, linear acceleration) — fast, drifty - Joint encoders (configuration) — accurate, no drift, but only relative to the base - Joint torques / contact forces — for contact events and ground-reaction estimation - Wheel/leg odometry — position, drifty - Vision/LiDAR fixes — absolute corrections, slow and occasional into one coherent estimate of the robot's pose, velocity, and (for legged robots) contact state. The EKF or its error-state cousin is the standard machinery. ### Timing and synchronization: the silent killer Here is the part that bites every team building their first multi-sensor robot. **Fusion is only as good as your timing.** When you combine a 1 kHz IMU with a 30 Hz camera and a torque reading arriving over CAN with jittery latency, you must know *when* each measurement was actually taken — not when it arrived at your code. A measurement applied at the wrong time is worse than no measurement. On a humanoid balancing at 1 kHz, a 5 ms timing error on the IMU is a several-percent error in the integrated velocity used for the next step — enough to walk the robot off the edge of stable. The fixes are unglamorous but essential: **hardware timestamping** (latch the time the instant the sensor triggers), **time synchronization** across buses (PTP/IEEE-1588 on EtherCAT, sync pulses for cameras), and **latency compensation** (apply a delayed measurement as a correction to a *past* state, not the present one). > **Rule of thumb**: budget your fusion as a *timing* problem first and a *math* problem second. Most "the EKF won't converge" failures are timestamp/latency bugs, not tuning bugs. ### The role in legged and humanoid balance For a wheeled robot, a wrong state estimate means a navigation error. For a **legged or humanoid** robot, it means a *fall*. Balancing inverted-pendulum dynamics demands a high-rate, low-latency estimate of body attitude, body velocity, and which feet are in contact — fused from IMU, joint encoders, and contact/force sensing. The contact-state estimate (which foot is on the ground, with what force) is itself a fusion problem, often using joint torque or foot force sensors. This is why legged robots run their state estimator at 500 Hz–1 kHz with carefully synchronized sensors; the [legged/quadruped guide](/posts/legged-quadruped-robot-hardware-ultimate-guide/) and [humanoid hardware guide](/posts/humanoid-robot-hardware-ultimate-guide/) go deeper on the dynamics this estimate feeds. ## Selecting & integrating sensors Pulling it together: choosing and wiring the self/contact sensing stack is a systems problem, and the failures are usually integration failures, not transducer failures. ### Choose by the numbers, in order 1. **Range** — does it cover your worst case with headroom (and survive overload)? 2. **Resolution at your operating point** — is the smallest resolvable step fine enough where you actually work, not just at full scale? 3. **Bandwidth and latency** — fast enough for your control loop without adding destabilizing phase lag? 4. **Noise and drift** — quiet and stable enough that you are not fusing garbage? 5. **Interface** — does it fit your bus and timing model (below)? 6. **Mechanical and environmental** — size, mass, mounting, IP rating, temperature, vibration. ### Sampling rate and bandwidth Sample at least **2× the highest frequency you care about** (Nyquist), and in practice **5–10×** for clean control. A 1 kHz balance loop wants an IMU sampled at several kHz; a slow temperature monitor is happy at 1 Hz. Over-sampling and then filtering buys noise reduction; under-sampling aliases high-frequency noise into your band irreversibly. Anti-alias filtering before the ADC is not optional for analog sensors. ### Interfaces: SPI vs I²C vs CAN vs EtherCAT | Interface | Typical use | Speed | Notes | |---|---|---|---| | **I²C** | IMUs, ToF, simple sensors | ~100 kHz–1 MHz | Cheap, multi-drop, but slow and not great for high-rate IMUs; addressing conflicts | | **SPI** | High-rate IMUs, fast ADCs | up to tens of MHz | Fast, low-latency, point-to-point — the right choice for a 1 kHz+ IMU | | **Analog + ADC** | Load cells, strain, NTC | — | Needs a clean amplifier and anti-alias filter; you own the noise | | **CAN / CAN-FD** | Joint drives, F/T sensors, distributed nodes | 1–8 Mbit/s | Rugged, multi-drop, deterministic-ish; standard on robot joints | | **EtherCAT** | Industrial F/T, full robot buses | 100 Mbit/s | Deterministic, hardware-synchronized (DC), the gold standard for synced multi-sensor robots | | **USB** | Bench/research F/T, GelSight | — | Convenient, not real-time; fine for non-loop sensing | > **Rule of thumb**: put your fast, loop-critical sensors (IMU, motor current, joint encoders) on SPI or a synchronized fieldbus (EtherCAT/CAN). Reserve I²C and USB for sensors that are not in a tight control loop. Mixing a high-rate IMU onto a shared I²C bus with five other devices is a classic self-inflicted latency wound. ### Mounting matters more than you think - **IMU placement**: mount rigidly, near the center of mass, away from vibration sources (motors, fans). Vibration aliases into the gyro and accel and no filter fully removes it; soft-mount the board if needed. A few degrees of mounting misalignment is a calibratable but real error. - **F/T sensors**: mount stiffly between flange and tool, and account for **tool weight and inertia** — gravity and acceleration of the tool show up as forces you must compensate (the "payload calibration" step). - **Strain/load cells**: protect against off-axis loads and overload; a load cell loaded sideways reads wrong and can be damaged. - **Magnetometers**: keep them as far from motors, current-carrying wires, and ferrous structure as possible, and calibrate hard-iron/soft-iron *in situ* with the actual robot. ### Calibration is not optional Every sensor here needs calibration, and skipping it is the most common reason a "working" sensor gives bad data: - **IMU**: gyro bias (re-zeroed at each startup while still), accel scale/bias (six-position tumble), magnetometer hard/soft-iron (figure-8 motion), and temperature compensation over a range. - **F/T**: zero/bias before each force task (it drifts with temperature), plus the maker's calibration matrix and tool-payload compensation. - **Load cells/strain**: tare and span with known weights. - **Tactile**: per-taxel offset/gain; gel sensors need illumination and geometry calibration. > **Rule of thumb**: budget calibration as a recurring runtime procedure, not a one-time factory step. The sensors that drift (IMUs, F/T, strain) need to be re-zeroed in the field, and your software should make that a first-class operation, not a hack. ## Frequently asked questions **What is the difference between proprioception and exteroception, and which sensors are which?** Proprioception is the robot sensing its own body — joint angles (encoders), body attitude and rate (IMU), joint torque (current estimation or torque sensors). Exteroception is sensing the external world — cameras, LiDAR, depth, microphones. Force/torque and tactile sensors straddle the line (they measure the world's contact) but behave like proprioception in the control loop. This guide covers proprioception and contact; exteroception is in the [LiDAR & depth cameras guide](/posts/lidar-depth-cameras-ultimate-guide/). **Do I need a 6-axis or a 9-axis IMU?** Use a 6-axis (accel + gyro) if you have another absolute-heading source — vision/SLAM, wheel odometry, GPS — because those bound the yaw drift that a 6-axis IMU cannot fix on its own. Add the magnetometer (9-axis) only if you have no other heading reference *and* your environment is magnetically clean (away from big motors and ferrous structure). On most indoor robots near big motors, the magnetometer is more trouble than it is worth, and people bound yaw with vision instead. **Why does my robot's heading drift even though the IMU is "good"?** Because yaw is unobservable from the accelerometer. Roll and pitch are corrected against gravity, but yaw rotates the robot around the gravity vector, which the accelerometer cannot see. With no magnetometer or vision fix, the integrated gyro heading drifts without bound. The fix is an absolute heading source, not a better gyro. **What is angle random walk and why should I care more than about range?** ARW (°/√h) quantifies how fast the gyro's angle estimate drifts due to white noise during integration. It directly sets how long you can dead-reckon attitude before the error matters. Range (±250 to ±2000 °/s) only matters if you spin fast; ARW and bias instability determine accuracy for nearly every robot. Read them off an Allan-variance plot of your own hardware. **Can I get joint torque without a torque sensor?** Yes — estimate it from motor current: `τ ≈ Kt · Iq · N · η − τ_friction`. The FOC controller already measures the q-axis current, so the estimate is essentially free and runs at full motor bandwidth. It is good enough for collision detection, gravity compensation, and hand-guiding (this is how most cobots work). It is *not* accurate for fine force control because friction, gear losses, and Kt variation corrupt it — for that, add a wrist F/T sensor or per-joint torque sensors. **What is crosstalk on an F/T sensor and how bad is it?** Crosstalk (cross-axis coupling) is when a load on one axis produces a spurious reading on another — e.g. a pure Fz showing up as a small Fx or Tx. It is typically 1–5% of full scale on a good sensor. The manufacturer's calibration matrix corrects most of it, but residual crosstalk limits how cleanly you can resolve one axis while others are loaded. It matters most in multi-axis contact tasks like insertion. **GelSight vs BioTac vs a simple FSR — when do I use each?** Use an FSR or barometric/capacitive array for cheap, robust grip-force and contact-presence sensing on a production gripper. Use a GelSight/DIGIT optical sensor when you need rich contact geometry, slip detection, and in-hand pose for dexterous manipulation research — accepting the bulk, camera latency, and compute. Use a SynTouch BioTac when you specifically want multimodal (pressure + vibration + thermal) biomimetic sensing, e.g. material recognition. Most production robots use the cheap array; the optical/biomimetic sensors are research and high-end dexterity tools. **How do I choose a sampling rate?** At least 2× your highest frequency of interest (Nyquist), and 5–10× in practice for clean control. A 1 kHz control loop wants an IMU sampled at several kHz; a temperature monitor is fine at 1 Hz. Always anti-alias filter analog sensors before the ADC — under-sampling folds high-frequency noise into your band permanently. **SPI or I²C for my IMU?** SPI, for anything in a tight control loop. I²C tops out around 1 MHz, is shared (adding latency and contention with other devices), and is awkward at high rates. SPI is point-to-point, runs at tens of MHz, and gives low, deterministic latency — exactly what a 1 kHz+ IMU needs. Save I²C for slow, non-loop sensors like a ToF ranger or a temperature chip. **Why does my F/T sensor reading drift during a task?** Thermal zero drift. The strain bridges shift their zero as the sensor warms (from ambient, from nearby motors, from its own electronics) — often by several newtons over minutes. Re-bias (tare) the sensor right before a force-sensitive operation, and prefer sensors with built-in temperature compensation if you cannot control the thermal environment. **Do I really need an EKF, or is a complementary filter enough?** If your sensors are just an IMU (± magnetometer) and you want attitude, a Madgwick/Mahony complementary filter is enough and far simpler. Move to an EKF when you must fuse heterogeneous, time-stamped sensors (encoders, odometry, vision, GPS), when you need a covariance for downstream consumers, or when you want the filter to estimate and remove gyro bias as a state. Do not deploy an EKF to do a job a 20-line complementary filter does fine — but do not try to bolt vision and odometry onto a complementary filter either. **What is the single most common sensor-integration mistake?** Timing. Multi-sensor fusion lives or dies on knowing *when* each measurement was actually taken, not when it arrived at your code. Most "the filter won't converge" or "the robot is unstable" problems trace to un-timestamped or wrongly-latency-compensated measurements, not to the transducers or the math. Budget hardware timestamping and time synchronization (EtherCAT DC, PTP, sync pulses) from the start. ## Changelog - **2026-06-04** — Initial publication. --- # Robot Calibration & Hand-Eye Calibration: The Ultimate Guide URL: https://blog.robo2u.com/posts/robot-calibration-ultimate-guide/ Published: 2026-06-03 Updated: 2026-06-20 Tags: robot-calibration, kinematic-calibration, hand-eye-calibration, accuracy, tcp-calibration, absolute-accuracy, dh-parameters, guide Reading time: 36 min > A 2026 field guide to robot calibration: accuracy vs repeatability, geometric and non-geometric error sources, DH/modified-DH kinematic identification, TCP and hand-eye calibration, thermal drift, and ISO 9283 validation. A six-axis industrial arm will return to the same taught point thousands of times and land within ±0.03 mm of where it was last time. Show the same arm a *new* point — one it has never been taught, computed purely from its kinematic model — and it may miss by 1 mm. Sometimes 2 mm. The number on the datasheet that says "repeatability ±0.02 mm" is true and the gap to that 1 mm miss is the single most expensive misunderstanding in factory automation. People buy a robot for its repeatability and then write programs that depend on its accuracy, which is a different and much worse number, and then they spend three weeks touching up points by hand wondering why offline programming "doesn't work." Calibration is how you close that gap. Not one thing — a family of related procedures, each attacking a different error source, each with its own measurement instrument, math, and failure modes. This guide walks the whole family: why accuracy and repeatability diverge, where the errors actually come from (and which ones calibration can fix versus which it can only compensate), kinematic identification with a laser tracker, tool-frame and base-frame calibration, mastering and encoder zeroing, the AX=XB hand-eye problem, payload identification, thermal drift, and how you prove the result with ISO 9283. Numbers with units, math you can read, and opinions with the reasons attached. **The take**: Repeatability is a property of the hardware; accuracy is a property of the *model*, and the model is the cheap thing to fix. A €60k arm calibrated to ±0.15 mm absolute will out-perform a €120k arm running its factory-default kinematics for any task that involves CAD-driven points, vision guidance, or moving a program between two "identical" robots. Kinematic calibration is the highest-leverage half-day of measurement in the building — but only if you measure with something an order of magnitude better than your target, identify the *observable* parameters and no more, and then validate on poses you did not use to fit. Skip the validation and you have not calibrated, you have curve-fitted noise. Companion reading: [robot kinematics & motion planning](/posts/motion-planning-kinematics-ultimate-guide/), [encoders](/posts/encoders-ultimate-guide/), [machine vision](/posts/machine-vision-ultimate-guide/), and [industrial robot arms](/posts/industrial-robot-arms-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [Accuracy vs repeatability: the gap that surprises people](#accuracy-repeatability) 3. [Where the errors come from](#error-sources) 4. [Kinematic calibration: identifying the model](#kinematic-calibration) 5. [The measurement step: trackers, CMMs, photogrammetry](#measurement) 6. [TCP and tool-frame calibration](#tcp-calibration) 7. [Base and work-object frame calibration](#base-frame) 8. [Mastering, homing & encoder zeroing](#mastering) 9. [Hand-eye calibration: the AX=XB problem](#hand-eye) 10. [Payload & load identification](#payload) 11. [Thermal compensation & drift](#thermal) 12. [When calibration pays off](#when-it-pays) 13. [Validation per ISO 9283](#iso-9283) 14. [Tools & practical workflow](#workflow) 15. [Frequently asked questions](#faq) ## Key takeaways - **Repeatability and accuracy are different numbers, and the gap is large.** A modern 6-axis arm is repeatable to ±0.02–0.05 mm but accurate (out of the box) only to ±0.5–2 mm. Repeatability is set by encoders, backlash, and structural stiffness; accuracy is set by how well the controller's *model* of the arm matches the steel. Calibration fixes the model, not the steel. - **~90% of absolute-position error is geometric** — wrong link lengths, twists, and joint offsets baked into the controller's nominal Denavit–Hartenberg table. These are constant, observable, and fully correctable by kinematic identification. This is why kinematic calibration delivers the biggest single improvement, typically taking a robot from ~1 mm to ~0.15 mm. - **The remaining error is non-geometric and harder.** Joint compliance (gravity sag and payload deflection), gearbox backlash and transmission error, thermal growth, and encoder eccentricity. Compliance and thermal effects can be *modeled and compensated*; backlash you mostly design around with consistent approach directions. - **Measure with an instrument ~10× better than your target.** Laser trackers (Leica AT960, API Radian, FARO Vantage) give ~15 µm + 6 µm/m volumetric accuracy and are the default for arm calibration. Photogrammetry/Creaform for larger volumes; a CMM only for small workcells or end-effectors. - **TCP calibration is geometry, not kinematics.** The 4-point method finds tool *position* by jogging one physical point to a fixed tip from several orientations; you need 5–6 points and orientation references to get the full tool *frame*. Garbage TCP makes good kinematics look broken. - **Mastering/homing must be right first.** Encoder zero offsets are part of the kinematic model. If a joint's zero is off by 0.1°, no amount of link-length fitting saves you — the error couples into every pose. Re-master after any motor/encoder/gearbox service. - **Hand-eye calibration solves AX=XB** — finding the rigid transform between the camera and either the flange (eye-in-hand) or the world (eye-to-hand). Tsai–Lenz and Park–Martin are the classic closed-form solvers; modern pipelines (OpenCV `calibrateHandEye`, MoveIt hand-eye, ROS) refine with nonlinear least squares. Rotation accuracy depends on having *large, varied* rotations between poses. - **Payload identification matters for accuracy and safety.** Wrong mass/CoG/inertia degrades path accuracy, trips collision detection, and on cobots breaks the force estimate. Most controllers (KUKA LoadDataDetermination, ABB LoadIdentify, FANUC) auto-identify it by running a characterization move. - **Thermal drift is real and sneaky.** A robot can move 0.1–0.3 mm over the first 1–2 hours from cold as joints warm. For sub-0.1 mm work, warm up the robot, or add temperature sensors and a thermal model. - **Calibration pays off when you depend on accuracy, not repeatability**: offline programming from CAD, vision-guided picking, multi-robot cells where programs must port between arms, metrology/inspection, and any drill/route/dispense task driven by a CAD path. - **Validate per ISO 9283** on poses you did *not* use to fit the model. Report pose accuracy (AP) and pose repeatability (RP) over the standard test cube at 10%/50%/100% rated load and speed. A calibration that isn't validated on a hold-out set is not trustworthy. - **Parameter observability is the trap.** A naive DH model has redundant parameters that are not observable from the measurement geometry; fitting them blindly amplifies noise. Use a model that drops the unobservable parameters (modified-DH plus the Hayati correction for near-parallel axes) and check the identification Jacobian's condition number. ## Accuracy vs repeatability: the gap that surprises people The two words get used interchangeably in casual speech and they are not interchangeable at all. The dartboard analogy is overused but correct: **repeatability** is how tightly your throws cluster; **accuracy** is how close that cluster sits to the bullseye. A robot can be exquisitely repeatable and badly inaccurate — a tight cluster two inches left of center. **Repeatability (RP)** is the robot's ability to return to a previously *taught* pose. You jog the arm to a point, save the joint angles, and command it back. The encoders read the same counts, the joints servo to the same angles, the tool lands in the same place — within the spread caused by encoder resolution, servo settling, backlash on the approach, and structural micro-vibration. This is what the datasheet's headline number describes, and for a quality 6-axis arm it is genuinely ±0.02–0.05 mm. **Accuracy (AP, "absolute accuracy")** is the robot's ability to reach a pose specified in *Cartesian coordinates* it has never been taught — for example, a point read from a CAD file or computed by a vision system. To do this the controller runs inverse kinematics on its internal model of the arm, computes joint angles, and servos there. If the model says link 2 is 700.0 mm long but the steel is actually 700.4 mm, every IK solve inherits that error. Out of the box, absolute accuracy is typically ±0.5–2 mm, sometimes worse near the edge of the workspace. > **Rule:** Teach-and-repeat programs lean on repeatability and don't care about accuracy. Anything driven by external coordinates — CAD, vision, another robot's frame — leans on accuracy. Know which kind of program you are writing before you trust a number. Here is the crux: **calibration cannot improve repeatability.** Repeatability is a hardware property — you change it by buying better encoders, stiffer gearboxes, less backlash, a heavier casting. Calibration improves *accuracy* by correcting the model, and it can only ever get you as good as your repeatability. If the arm scatters ±0.05 mm on a repeated point, no model on earth makes it accurate to ±0.01 mm. Repeatability is the floor; accuracy after calibration approaches but never beats it. | Property | What it measures | Set by | Typical 6-axis arm | Improved by | |---|---|---|---|---| | Repeatability (RP) | Return to a *taught* pose | Encoders, backlash, stiffness, servo | ±0.02–0.05 mm | Better hardware (not calibration) | | Accuracy (AP) | Reach a *commanded* Cartesian pose | Kinematic model fidelity | ±0.5–2 mm (uncalibrated) | Calibration (model fitting) | | Accuracy after kinematic cal | same | Model + measurement quality | ±0.10–0.20 mm | More measurements, better instrument | | Accuracy after full cal (+compliance/thermal) | same | Model + compensation | ±0.05–0.10 mm | Compliance & thermal modeling | The gap between columns two and four in that "accuracy" row — roughly 1 mm down to 0.15 mm, a factor of ~6–10 — is what kinematic calibration buys you in a half-day. That is the leverage. ## Where the errors come from To know what calibration can and cannot fix, you have to know the error budget. Errors split cleanly into **geometric** (constant, in the kinematic geometry) and **non-geometric** (load- or temperature- or direction-dependent). Roughly 80–90% of absolute error in a well-built arm is geometric, which is the good news: geometric error is constant and fully correctable. ### Geometric errors These are mismatches between the controller's nominal kinematic parameters and the as-built machine. Every revolute joint contributes four DH parameters; manufacturing tolerances and assembly put each one slightly off: - **Link length (`a`) error** — the perpendicular distance between consecutive joint axes is off by tenths of a millimeter. Castings and machined surfaces have tolerances. - **Link twist (`α`) error** — consecutive joint axes aren't perfectly perpendicular/parallel as the nominal model assumes; they're off by hundredths of a degree. Small angles, long lever arms. - **Joint offset (`d`) error** — translation along a joint axis is slightly wrong. - **Joint angle offset (`θ` offset, the encoder zero)** — the angle the controller calls "zero" doesn't coincide with the geometric zero. This is the *mastering* error and it's the biggest single geometric contributor because it sits at the base of the chain and multiplies down it. The leverage of an angular error is what makes this brutal. A small joint-angle error becomes a Cartesian error proportional to the distance from that joint to the tool: ```text Tip error from a single joint-angle error: e ≈ θ_err · L where θ_err = joint angle error (radians) L = distance from that joint axis to the TCP (mm) Example: θ_err = 0.05° on joint 1, TCP at L = 1500 mm reach θ_err = 0.05° × (π/180) = 8.73e-4 rad e ≈ 8.73e-4 × 1500 mm ≈ 1.31 mm A twentieth of a degree at the base = 1.3 mm at the tool. This is why mastering and base-joint zeros dominate the budget. ``` That single line — `e ≈ θ_err · L` — explains most of the surprise. Angular errors are tiny and the lever arm is long. It also explains why the *base* joints (1, 2, 3) matter far more than the wrist joints (4, 5, 6) for position accuracy: they have the whole arm hanging off them as a lever. ### Non-geometric errors These don't live in the link geometry and a pure DH fit can't capture them: - **Joint compliance / structural deflection** — gearboxes (especially harmonic drives) and links are not rigid. Under gravity and payload, the arm sags. A 10 kg payload at 1.5 m reach can deflect the tool 0.2–0.5 mm. This is *configuration- and load-dependent*, so it shows up as a residual that varies across the workspace. Compliance can be modeled (joint stiffness coefficients, often called elasto-geometric or stiffness calibration) and compensated. - **Backlash** — lost motion in the gear train when a joint reverses direction. Causes the tool to land in a slightly different place depending on approach direction. Hard to model cleanly; the practical fix is to always approach points from the same direction (unidirectional approach), which is also good practice for repeatability. - **Gear transmission error** — the output angle isn't a perfectly linear function of motor angle. Harmonic drives have a characteristic 2-cycle-per-revolution ripple of tens of arc-seconds. Periodic, position-dependent. Some high-end calibration captures it; most don't bother. - **Thermal growth** — links and gearboxes expand as they warm from cold start and from gearbox self-heating. Steel expands ~12 µm/m/°C, aluminum ~23 µm/m/°C. A 10 °C rise over a 1.5 m arm is ~0.18 mm (steel) to ~0.35 mm (aluminum). Slow drift over the first hour or two. - **Encoder eccentricity / runout** — if the encoder disc isn't perfectly centered on its axis, you get a once-per-revolution sinusoidal angle error. See [encoders](/posts/encoders-ultimate-guide/) for why mounting and bearing quality dominate here. - **Dynamic errors** — tracking error during motion, vibration, controller lag. These are speed-dependent and are not what static calibration addresses (path accuracy at speed is its own ISO 9283 test). | Error source | Type | Typical magnitude | Behavior | Calibration fixes it? | |---|---|---|---|---| | Link length / twist / offset | Geometric | 0.1–0.5 mm equiv. | Constant | Yes — kinematic identification | | Encoder zero (mastering) | Geometric | 0.5–2 mm if off | Constant | Yes — re-master + identify | | Joint compliance (gravity/payload) | Non-geometric | 0.1–0.5 mm | Config/load-dependent | Partly — stiffness model | | Backlash | Non-geometric | 0.02–0.1 mm | Direction-dependent | No — design around it | | Gear transmission error | Non-geometric | tens of arc-sec | Periodic in joint angle | Rarely — advanced only | | Thermal growth | Non-geometric | 0.1–0.35 mm | Slow drift, time/temp | Partly — warm-up or thermal model | | Encoder eccentricity | Non-geometric | arc-sec to arc-min | Periodic, 1/rev | Partly — per-joint correction | | Dynamic / tracking | Dynamic | speed-dependent | Transient | No — controller tuning | > **Rule:** Kinematic calibration corrects the constant geometric ~85% of the budget. To go below ~0.15 mm you have to start fighting the non-geometric residue — compliance and thermal first, because they're the largest and the most modelable. ## Kinematic calibration: identifying the model Kinematic calibration is parameter identification: you measure where the tool actually goes for many known joint configurations, then solve for the kinematic parameters that best explain the measurements. Four steps — **model, measure, identify, compensate** — and the discipline is mostly in steps one and three. ### Step 1: The model You need a parameterization of the kinematics whose parameters you'll fit. The standard is Denavit–Hartenberg, and you should use **modified-DH (Craig's convention)**, which places the frame at the *near* end of each link and makes the parameter assignment cleaner for identification. Each joint contributes the four parameters from above: `a` (link length), `α` (link twist), `d` (link offset), `θ` (joint angle, with the calibrated offset). For an *n*-joint arm that's 4n nominal parameters plus 6 for the base frame and 6 for the tool — but you will not, and should not, fit all of them. (See [robot kinematics](/posts/motion-planning-kinematics-ultimate-guide/) for the forward-kinematics machinery these parameters feed.) There is a famous trap in plain DH: when two consecutive joint axes are **parallel** (or nearly so — think the shoulder and elbow of most arms), the `d` and `θ` parameters become ill-defined and the model is *singular* with respect to small misalignments. A tiny twist between nominally parallel axes produces a huge, unstable change in `d`. The fix is the **Hayati–Mirmirani correction**: for near-parallel joints, replace the `d` parameter with an extra rotation parameter `β` about the *y*-axis. Use modified-DH + Hayati and this whole class of numerical instability disappears. > **Rule:** Never fit a raw DH model with near-parallel axes. Use modified-DH with the Hayati β correction for the parallel pairs, or your `d` parameters will run away to absurd values and your fit will look great on the training data and terrible everywhere else. ### Step 2: Measure (covered in detail below) Drive the robot to a set of *m* poses spread across the workspace (typically 30–100). At each, record the commanded joint angles `q_i` and measure the actual tool position (and orientation, if you can) with an external instrument. ### Step 3: Identify (the least-squares solve) This is the heart of it. The measured tool pose is a function of the joint angles and the true-but-unknown parameters `p`. The nominal model predicts a slightly wrong pose. Linearize the error in the parameters via the **identification Jacobian** and solve for the parameter corrections: ```text Kinematic identification — linearized least squares: measured pose: x_i^meas (from laser tracker) predicted pose: x_i = f(q_i, p_nominal) (forward kinematics) pose residual: Δx_i = x_i^meas − f(q_i, p_nominal) For all m poses, stack: Δx = J · Δp (J = identification Jacobian, ∂x/∂p) J has 3m (position-only) or 6m (full-pose) rows and as many columns as identifiable parameters. Least-squares correction (overdetermined, m >> #params): Δp = (Jᵀ J)⁻¹ Jᵀ Δx (normal equations) or solve via SVD / pinv for stability: Δp = pinv(J) · Δx Update and iterate (it's mildly nonlinear): p ← p_nominal + Δp, recompute J, repeat 2–4× until ||Δx|| stops shrinking. ``` In practice you wrap this in Levenberg–Marquardt rather than raw normal equations — it's more robust when `JᵀJ` is poorly conditioned, which it often is. The output is a corrected parameter set that you load into the controller (or into your offline model). ### Parameter observability The single most important concept and the one people skip. Not every parameter is **observable** from your measurements — some combinations of parameters produce identical tool motions and cannot be separated, and some produce motions your measurement geometry never sees. If you try to fit an unobservable parameter, the solver invents a value to soak up noise, and that value makes the model *worse* on new poses. Diagnose it with the **condition number** of the identification Jacobian `J`. A well-conditioned identification has a condition number in the tens to low hundreds; thousands means you have near-unobservable parameters and the solve is amplifying measurement noise. The fixes: (1) use a minimal, observable parameter set (modified-DH + Hayati already drops the classic redundancies); (2) choose measurement poses that *excite* the parameters you want — spread orientations and reach widely, don't cluster; (3) optionally run an observability-optimized pose selection (the O1–O5 observability indices in the literature) to pick the most informative configurations. > **Rule:** Fit only observable parameters, choose poses that excite them, and always check the condition number. An over-parameterized fit with a great training residual and a terrible validation residual is the textbook symptom of fitting noise. ## The measurement step: trackers, CMMs, photogrammetry Your calibration is only as good as your measurement, and the rule is unforgiving: **the instrument must be ~10× more accurate than your target.** Calibrating to ±0.15 mm means measuring to ~±0.015 mm. That requirement alone rules out most things and points straight at the laser tracker. **Laser trackers** are the default for arm calibration. A tracker (Leica Absolute Tracker AT960/AT930, API Radian, FARO Vantage/ION) sends a laser to a spherically-mounted retroreflector (SMR) on the robot flange and measures range by interferometry/absolute distance meter plus two angles. Volumetric accuracy is around ±15 µm + 6 µm/m, so ~±25 µm at 1.5 m. They measure at high rate, track a moving target, and reach across a whole cell. The 6DoF variants (Leica T-Mac, API STS) measure orientation too, which roughly doubles the information per pose and tightens the fit. This is what RoboDK, Dynalog CalibWare, and the OEM calibration services all use. **Photogrammetry / structured-light (Creaform)** systems (Creaform MetraSCAN/C-Track, GOM/ZEISS, AICON) track coded targets or a probe with stereo cameras. Accuracy is in the 20–60 µm range over volume — slightly behind a tracker but excellent for *large* volumes, multi-robot cells, and when you want to digitize a fixture or work-object surface at the same time. C-Track-style dual-camera systems give 6DoF naturally. **CMM (coordinate measuring machine)** is the most accurate (single-digit µm) but the worst *fit* for robot calibration: it's a fixed-volume gantry, you'd have to put the robot inside it, and the working volume rarely matches a robot's reach. Use a CMM to certify a TCP artifact or a small end-effector, not to calibrate the arm in situ. **Low-cost / on-machine methods** exist and have their place: a calibrated ballbar or telescoping double-ballbar, a fixed reference sphere probed from many orientations, or vision-based methods using a calibrated camera and target. They get you to ~0.3–0.5 mm — useful for a sanity check or a budget shop, not for true absolute accuracy. | Instrument | Volumetric accuracy | 6DoF? | Working volume | Best for | Rough cost | |---|---|---|---|---|---| | Laser tracker (Leica AT960, API Radian, FARO) | ±15 µm + 6 µm/m | Optional (T-Mac/STS) | Whole cell, 10s of m | Arm kinematic calibration (the default) | €80k–150k+ | | Photogrammetry (Creaform, GOM, AICON) | ~20–60 µm | Yes (dual-camera) | Large, multi-robot cells | Large volumes + surface digitizing | €60k–120k | | CMM | 1–5 µm | Pose via probing | Fixed, small | TCP artifacts, end-effectors | Fixed asset | | Ballbar / reference sphere | ~30–100 µm | No | Local | Cheap check, partial cal | €5k–20k | | Vision target (camera + checkerboard) | ~0.1–0.5 mm | Yes | Camera FoV | Hand-eye, budget cal | €1k–10k | > **Rule:** If you can't measure ~10× tighter than your accuracy goal, you can't verify whether you hit it — and an unverifiable calibration is a guess. Borrow or rent a tracker for the day rather than calibrate with the wrong tool. ## TCP and tool-frame calibration The Tool Center Point is the working point of whatever the robot holds — the tip of a welding torch, the center of a gripper's jaws, the nozzle of a dispenser. The controller knows the flange pose from kinematics; the **TCP offset** is the rigid transform from the flange frame to the tool's working frame. Get it wrong and every Cartesian motion, every reorientation about the tool, every taught point is wrong by that offset. This is *geometry, not kinematics* — you're finding a fixed 6-parameter transform, not fitting link parameters — but it's done on the robot and it's done constantly, so it deserves its own discipline. ### The 4-point method (position only) The classic. Place a fixed, sharp reference tip somewhere in the workspace. Jog the tool's working point to touch that single fixed point from **four (or more) very different orientations**. The flange is in a different pose each time, but the tool tip is at the same world point. The controller solves for the tool offset `(x, y, z)` that makes all four flange poses map the tip to one common point. ```text 4-point TCP — the constraint: For each touch i: p_world = T_flange,i · t_tool where T_flange,i = flange pose (known from kinematics) t_tool = unknown tool offset [x, y, z, 1]ᵀ p_world = the (also unknown) fixed reference point All touches share one p_world ⇒ overdetermined linear system in (t_tool, p_world). Solve by least squares. Quality depends on ORIENTATION SPREAD: four nearly-identical orientations give a near-singular system. Spread them wide (≥ 45° apart, mix all wrist axes) for a good solve. ``` Accuracy is typically ±0.2–0.5 mm and is limited by how precisely a human can jog the tip to the reference and by the *robot's own accuracy* — a calibrated arm gives a better TCP. Use 5–6 points, not the minimum 4; the extra touches average out jogging error. ### Getting orientation (the full tool frame) The 4-point method gives only the tool *position*. For a frame you need the tool's *orientation* relative to the flange. Methods: - **5/6-point (XYZ + Z, or XYZ + X + Z):** after the 4-point position solve, jog the tool along its intended +Z (and +X) from the reference point to teach the tool's axis directions. - **Reference-object / abc-world:** orient the tool to match a known reference orientation. - **CAD value:** for a precisely machined tool of known geometry, just type the offset from the drawing. Often better than touch-up for a well-made part — and combine with a touch-check. > **Rule:** A bad TCP makes a perfectly calibrated arm look broken — reorienting about the tool will sweep the tip through an arc instead of pivoting in place. If "rotate about TCP" doesn't keep the tip stationary, your TCP is wrong, full stop. That test is the fastest TCP sanity check there is. ## Base and work-object frame calibration Two more frames, both essential for any program that references the world rather than the robot. **Base / world frame** locates the robot's base in the cell's coordinate system. You need it whenever coordinates come from outside the robot: a conveyor, a fixture surveyed in CAD, a second robot, or a vision system reporting in world coordinates. Establish it by touching three known points (origin, +X direction, point in the +XY plane) with a calibrated TCP, or far better, by measuring the base frame directly with the laser tracker you already set up for kinematic calibration. Tracker-based base framing removes the human-jog error and is essential in multi-robot cells where two arms must agree on where the world is to better than 0.2 mm. **Work-object / user frame** locates the part or fixture you're working on. You teach points in this frame so that if the fixture moves (or you move the program to a second, slightly different fixture), you re-teach only the frame, not every point. The 3-point method (origin, +X, +XY) is standard. The big payoff: programs become portable. A weld program written in a work-object frame survives the fixture being relocated 10 mm and rotated 1° — you re-survey the frame and every taught point follows. > **Rule:** Build the dependency chain deliberately — world → base → work-object → TCP. Each frame inherits the error of the frames above it. A 0.5 mm base-frame error sits under *every* point in *every* work-object on that robot, so spend your best measurement on the frames nearest the base. ## Mastering, homing & encoder zeroing Before any of the above means anything, the robot has to know what angle each joint is actually at. **Mastering** (a.k.a. homing, zeroing, or "syncing") establishes the correspondence between each joint's encoder reading and its true geometric angle. It is the `θ`-offset parameter from the DH model, and as the `e ≈ θ_err · L` math showed, it has the longest lever arm of any error in the machine. Most industrial arms have a mechanical or optical reference per joint — a notch, a dial, a witness mark, or a reference cartridge/EMD that the controller probes — defining the master position. You drive each joint to its reference and tell the controller "this encoder count is the master angle." On absolute-encoder arms this survives power-down; on incremental-encoder arms the robot must home on startup. (The encoder distinction matters a lot here — see [encoders](/posts/encoders-ultimate-guide/).) Why it must be right: - **It's the largest geometric error if wrong.** A 0.1° mastering error on joint 1 of a 1.5 m-reach arm is ~2.6 mm at the tool (`e ≈ θ_err · L`). No link-length fit can recover from a wrong zero — the optimizer will distort *other* parameters trying to compensate, ruining the whole model. - **It changes after service.** Replacing a motor, encoder, gearbox, or even a hard collision can shift the master. Always re-master after mechanical service on a joint, and re-run (at least) a quick accuracy check afterward. - **It's a prerequisite, not a step.** Kinematic identification *includes* refining the joint-angle offsets, but it converges far better if you start from a good mechanical master. Garbage mastering in, garbage parameters out. > **Rule:** Re-master after any service that touches a joint's motor, encoder, or gearbox — then re-verify accuracy. A robot that was calibrated to 0.1 mm and then had joint 3's motor swapped is no longer calibrated, regardless of what the controller still claims. ## Hand-eye calibration: the AX=XB problem The moment you bolt a camera to a robot (or aim one at its workspace), you have a new unknown: the rigid transform between the camera's optical frame and the robot's frames. The camera reports object poses in *its* coordinates; the robot moves in *its* coordinates; nothing useful happens until you know the transform between them. Finding it is **hand-eye calibration**, and it underpins all vision-guided robotics. (For the camera side — intrinsics, lens distortion, stereo, depth — see [machine vision](/posts/machine-vision-ultimate-guide/) and [LiDAR & depth cameras](/posts/lidar-depth-cameras-ultimate-guide/); for the broader sensing context, [robot sensors](/posts/robot-sensors-ultimate-guide/).) ### Two configurations - **Eye-in-hand:** camera mounted on the robot flange/wrist, moving with the arm. You're solving for **X = flange→camera** transform. Common in pick-and-place and inspection where the camera needs to get close. - **Eye-to-hand (eye-to-base):** camera fixed in the cell, watching the workspace. You're solving for **X = base→camera** (equivalently camera→base). Common when one fixed overhead camera serves the whole cell. ### The math: AX = XB The classic formulation. Move the robot between pairs of poses while observing a fixed calibration target. Between two robot poses, the robot's flange moves by a known relative transform **A** (from the robot's forward kinematics) and the camera's view of the target moves by a measured relative transform **B** (from the vision solve). The unknown hand-eye transform **X** satisfies: ```text Hand-eye: A X = X B A = relative robot motion between two poses (from kinematics) B = relative camera-to-target motion (from vision) X = the unknown camera↔flange (or camera↔base) transform Split into rotation and translation: R_A R_X = R_X R_B (rotation: solve first) R_A t_X + t_A = R_X t_B + t_X (translation: solve second) Rotation accuracy needs LARGE, VARIED rotations between poses. Pure translation moves give NO rotation info — X_rot stays unobservable. Use ≥ 10–15 poses with big, diverse orientation changes (tip the camera ≥ 30–45° about different axes). ``` The rotation part is solved first (it's independent of translation), then translation is solved using the recovered rotation. **Closed-form solvers:** - **Tsai–Lenz (1989):** the workhorse. Solves rotation via an angle-axis (Rodrigues) formulation, then translation linearly. Fast, well-understood, the reference implementation in OpenCV (`CALIB_HAND_EYE_TSAI`). - **Park–Martin (1994):** uses Lie-group / `so(3)` least squares for the rotation, often more robust to noise than Tsai–Lenz. - **Horaud–Dornaika, Daniilidis (dual-quaternion):** Daniilidis solves rotation and translation *simultaneously* using dual quaternions, which can be more accurate when the two are coupled. Modern practice: get a closed-form initial estimate from one of the above, then **refine with nonlinear least squares** (bundle-adjustment-style, minimizing reprojection error over all poses jointly). OpenCV's `calibrateHandEye` offers all the classic methods; the MoveIt hand-eye calibration plugin and ROS pipelines wrap this with a live target (an ArUco/ChArUco board or AprilTag) and pose collection. ### Practical notes The dominant error driver is **rotation diversity**. People collect 12 poses that are all small nudges of position with the camera staring the same way, the rotation system is near-singular, and the result is a translation that looks plausible but a rotation that's off by a couple of degrees — which then throws position errors that grow with target distance. Tip and twist the camera aggressively across poses. Use a **ChArUco board** over a plain checkerboard (it tolerates partial occlusion and gives sub-pixel corners), keep the board flat and rigid, and span the camera's working depth. | Method | Rotation approach | Solves R,t | Noise robustness | Use when | |---|---|---|---|---| | Tsai–Lenz | Angle-axis (Rodrigues) | Sequentially | Good | Default; well-tested baseline | | Park–Martin | Lie-group / so(3) LS | Sequentially | Better | Noisier data, want robustness | | Horaud–Dornaika | Quaternion / nonlinear | Sequentially or joint | Good | Moderate noise | | Daniilidis (dual-quaternion) | Dual quaternion | Simultaneously | Best when R,t coupled | R and t strongly coupled | | Nonlinear refinement (BA) | Manifold optimization | Jointly, all poses | Best overall | Always, as a final polish | > **Rule:** Hand-eye rotation accuracy lives or dies on orientation diversity between poses. If your poses don't include large, varied rotations, the rotation is unobservable no matter which solver you pick — and a 1° rotation error becomes a position error that grows linearly with how far the target sits from the camera. ## Payload & load identification The controller needs to know the **mass, center of gravity, and inertia tensor** of whatever the robot carries. This isn't just dynamics housekeeping — it bears directly on accuracy and safety. - **Accuracy:** payload load deflects the arm (the compliance term from the error budget). The controller's gravity-compensation and any stiffness model need the correct mass and CoG to predict and cancel that deflection. Wrong payload, wrong compensation, worse accuracy at speed. - **Safety and collision detection:** the controller estimates external forces by comparing expected joint torques (from the dynamic model + payload) against measured torques. If the declared payload is wrong, the residual is wrong, and collision detection either nuisance-trips or — worse — fails to trip. On cobots this is the foundation of force/torque-based safety and hand-guiding ([robot safety](/posts/robot-safety-functional-safety-ultimate-guide/) covers the safety side). - **Path tracking:** feedforward dynamic compensation needs the inertia tensor to anticipate the torques for accelerations. Wrong inertia, more tracking error during fast moves. Every major OEM ships a **load identification** routine: KUKA *LoadDataDetermination*, ABB *LoadIdentify*, FANUC payload estimation, UR's built-in payload wizard. You mount the load, run a prescribed characterization motion (the robot moves several joints through a sequence while measuring motor torques), and the controller solves for mass, CoG, and inertia from the torque data. Run it whenever the end-effector or grasped part changes significantly — and for variable payloads (e.g., a gripper that sometimes holds a 0.5 kg part and sometimes a 5 kg part), configure multiple payload records and switch in software. > **Rule:** Declare the real payload. A wrong payload silently degrades accuracy, defeats collision detection, and on a cobot corrupts the force estimate the safety case depends on. The auto-identify routine takes two minutes; run it. ## Thermal compensation & drift The error that ambushes people who calibrated perfectly in the morning and find the robot off by 0.2 mm by mid-shift. The arm changes shape as it warms — from ambient swings, from sun on a wall, and most of all from the gearboxes generating heat as they work. The physics is just thermal expansion: steel ~12 µm/m/°C, aluminum ~23 µm/m/°C. A robot's links and gearbox housings warm 5–15 °C from cold start to thermal equilibrium over the first 1–2 hours of operation. Over a 1.5 m arm that's roughly 0.1–0.35 mm of drift — and because the heating is uneven (gearboxes hot, links cooler), it's not a simple uniform scale. For teach-and-repeat work nobody notices (repeatability is unaffected; the whole frame drifts together-ish). For absolute-accuracy work it's a real, time-varying error on top of your calibration. What to do, in order of effort: 1. **Warm up the robot.** The cheapest fix. Run a representative motion cycle for 30–60 minutes before precision work, and calibrate when warm. Many shops mandate a warm-up program. 2. **Calibrate at operating temperature.** If the robot runs hot, calibrate hot. A calibration done cold is wrong by the drift amount once the robot warms. 3. **Thermal model + temperature sensors.** High-end systems (and some OEM accuracy packages) put temperature sensors on the joints and apply a thermal-expansion correction to the kinematic model in real time. This is what gets you stable sub-0.1 mm accuracy across a shift. 4. **Control the environment.** Stable ambient temperature, no direct sun, no HVAC blasting one side of the cell. > **Rule:** A calibration is valid at the temperature it was taken. If you need sub-0.1 mm all shift, either warm the robot to a steady state and keep it there, or instrument it with temperature sensors and a thermal model. "We calibrated it once, cold" is not a thermal strategy. ## When calibration pays off Calibration isn't free — instrument time, downtime, expertise — so spend it where accuracy (not repeatability) is the constraint. The tells: - **Offline programming (OLP).** Generating robot programs from CAD in RoboDK, Process Simulate, Delmia, or RobotStudio. The whole point of OLP is to skip manual teach-up; that only works if the real robot matches the simulated model, which means it must be accurate, not just repeatable. **OLP without calibration is the #1 disappointment in this field** — people generate a beautiful program and then spend days touching up every point because the arm is 1 mm off. Calibrate to ~0.15 mm and the touch-up nearly vanishes. - **Vision-guided tasks.** Bin picking, conveyor tracking, any pick from a vision-reported pose. The robot reaches Cartesian coordinates it was never taught — pure accuracy dependence. Garbage accuracy means the gripper misses the part even with a perfect vision solve. - **Multi-robot cells / program portability.** When a program must move between "identical" robots (line balancing, replacing a failed arm, deploying the same job to 20 stations), each arm's accuracy must be good enough that one program fits all. Uncalibrated, every arm is uniquely wrong by ~1 mm and programs don't port. Calibrated arms are interchangeable. - **Metrology and inspection.** The robot *is* the measuring instrument (or carries one). Accuracy is the spec. - **CAD-path process tasks.** Drilling, routing, deburring, dispensing, waterjet, additive — anywhere the path comes from CAD and tolerances are tight. Where calibration buys you little: a fixed pick-place-stack cell with hand-taught points and no external coordinates. That's pure teach-and-repeat; repeatability carries it and calibration adds nothing the program uses. Don't calibrate reflexively — calibrate the robots whose programs depend on accuracy. ## Validation per ISO 9283 You haven't calibrated until you've *measured* the result on poses you didn't use to fit the model. The standard for industrial-robot performance is **ISO 9283:1998 (Manipulating industrial robots — Performance criteria and related test methods)**, and it defines exactly what to measure and how. ISO 9283 prescribes a test setup: a cube positioned in the working space (typically the largest cube that fits, tilted to use the workspace), with measurement at the cube's diagonal-plane points (P1–P5). The robot is sent to these poses repeatedly (30 cycles per the standard) at specified speeds and loads, and an external instrument records where it actually lands. Key metrics: - **Pose accuracy (AP):** the distance between the *commanded* pose and the *mean* of the attained poses. This is absolute accuracy — what calibration improves. Split into position (APp) and orientation (APa, APb, APc) components. - **Pose repeatability (RP):** the spread (radius of the sphere containing the attained-pose cluster, at a confidence level) of the attained poses about their mean. This is repeatability — calibration does *not* change it. - Plus: distance accuracy/repeatability (AD/RD), path accuracy (AT) and path repeatability (RT) for continuous-path work, cornering, velocity accuracy, and more. > **Rule:** Test at 10%, 50%, and 100% of rated load and rated speed per the standard — not just unloaded and slow. Accuracy degrades with payload (compliance) and speed (dynamics), and a calibration that's only verified at low load and low speed hides exactly the conditions that bite in production. The non-negotiable discipline: **validate on a hold-out set.** Use one set of poses to *fit* the kinematic parameters and a *different*, independent set to *measure* AP and RP. If you report the residual on the fitting poses as your accuracy, you're reporting how well you memorized the noise, not how well the model generalizes. A good calibration shows a fitting residual and a validation residual that are close (e.g., 0.12 mm fit, 0.15 mm validation). A big gap (0.05 mm fit, 0.4 mm validation) is the signature of over-fitting unobservable parameters — go back to the model and the condition number. ## Tools & practical workflow The software and the order of operations. **Calibration software:** - **RoboDK** — popular, affordable OLP suite with a calibration module that drives a laser tracker, runs the identification, and writes corrected kinematics back to the robot or into the OLP model. Strong for the calibrate-then-OLP workflow. - **Dynalog CalibWare / DynaCal** — long-established dedicated robot-calibration package, tracker-driven, used by OEMs and integrators. - **OEM accuracy packages** — ABB *Absolute Accuracy*, KUKA accuracy options, FANUC, Stäubli. These are factory-calibrated-at-build options where the robot ships with identified parameters and (sometimes) compliance/thermal compensation. Buy the absolute-accuracy option at order time if your application needs it — retrofitting is more work. - **MoveIt 2 hand-eye calibration** and OpenCV `calibrateHandEye` for the vision side; ChArUco/AprilTag targets for pose collection. - **Metrology software:** Leica Tracker Pythons/SpatialAnalyzer, PolyWorks, Verisurf for the measurement and analysis. **A practical end-to-end workflow:** 1. **Mechanical check first.** Verify mounting is rigid, no loose bolts, gearboxes serviced. Then **master/home** every joint to its reference. Mastering is the foundation — do it right or stop here. 2. **Warm up** the robot to operating temperature (30–60 min representative cycle) so you calibrate hot if it runs hot. 3. **Set up the laser tracker**, mount the SMR/6DoF target on the flange, establish the tracker-to-robot relationship. 4. **Collect calibration poses** — 30–100 configurations spread widely across the workspace and orientation range, ideally observability-optimized. Record commanded joint angles and measured tool poses. 5. **Identify** the kinematic parameters: modified-DH + Hayati for parallel axes, Levenberg–Marquardt least squares, check the **condition number**, fit only observable parameters. 6. **Load** the corrected parameters into the controller (or OLP model). 7. **Calibrate the TCP** (5–6 point + orientation) and the **base / work-object frames**, ideally tracker-measured. 8. **Identify the payload** with the OEM routine. 9. **Validate per ISO 9283** on a hold-out pose set, at 10/50/100% load and speed. Report AP and RP. 10. **Document and schedule re-checks** — re-verify periodically and after any service touching a joint. > **Rule:** Order matters. Master → warm up → kinematic identify → TCP/frames → payload → validate. Each step assumes the previous ones are correct; doing TCP before mastering, or skipping the warm-up before a precision calibration, quietly poisons everything downstream. ## Frequently asked questions **Why is my robot repeatable to 0.02 mm but misses CAD points by 1 mm?** Because those are different specifications. Repeatability is returning to a *taught* pose — pure hardware. Reaching a CAD point requires the controller to run inverse kinematics on its internal model, and that model is off by manufacturing tolerances, so every computed pose inherits ~1 mm of geometric error. Kinematic calibration fixes the model and typically brings absolute accuracy to ~0.15 mm. **Can calibration improve repeatability?** No. Repeatability is set by encoders, backlash, and structural stiffness — hardware. Calibration corrects the kinematic model, which only affects *accuracy*. Calibrated accuracy can approach but never beat the repeatability floor: if the arm scatters ±0.05 mm, no model makes it accurate to ±0.01 mm. **Do I need a laser tracker, or can I use a cheaper method?** For true absolute accuracy (~0.15 mm) you need to measure ~10× tighter (~0.015 mm), which means a laser tracker or comparable photogrammetry. Cheaper methods (ballbar, reference sphere, vision target) reach ~0.3–0.5 mm — fine for a sanity check or a budget shop, not for verified sub-0.2 mm accuracy. Rent a tracker for the day if buying isn't justified. **What's the difference between DH and modified-DH for calibration?** Both are 4-parameter-per-joint kinematic conventions. Modified-DH (Craig) puts the frame at the near end of each link, which makes parameter assignment cleaner for identification. For calibration, always use modified-DH *plus* the Hayati β correction on near-parallel joint pairs — plain DH is numerically singular for parallel axes and the `d` parameters blow up. **My hand-eye calibration translation looks right but the rotation seems off — why?** Almost always insufficient rotation diversity in your poses. The AX=XB rotation is only observable if the camera undergoes large, varied rotations between poses. If your poses are mostly translations with the camera pointing the same way, the rotation solve is near-singular. Tip and twist the camera ≥30–45° about different axes across ≥10–15 poses. **Eye-in-hand or eye-to-hand — which should I use?** Eye-in-hand (camera on the flange) when the camera needs to get close to the work, inspect from varied viewpoints, or serve a large workspace from one moving sensor. Eye-to-hand (fixed camera) when one overhead view covers the whole cell and you want the camera out of the way. The AX=XB math is the same; you solve for flange→camera vs base→camera respectively. **How often do I need to re-calibrate?** Re-master and re-verify after any service touching a joint's motor, encoder, or gearbox, or after a hard collision. Otherwise schedule a periodic accuracy check (quarterly to yearly depending on duty) — kinematic parameters are stable in steel, but wear, thermal cycling, and minor crashes drift them over time. **Why does my robot drift during the day even though I calibrated it?** Thermal growth. Links and gearboxes warm 5–15 °C from cold start to equilibrium, expanding 0.1–0.35 mm over a 1.5 m arm. A cold calibration is wrong once the robot warms. Warm the robot before precision work and calibrate hot, or instrument it with temperature sensors and a thermal model for stable sub-0.1 mm all shift. **Does the payload really affect accuracy, or just dynamics?** Both. Payload deflects the arm (compliance), so wrong payload means wrong deflection compensation and worse accuracy — especially at reach and speed. It also corrupts the torque-based collision-detection and force estimate, which is a safety issue on cobots. Run the OEM load-identification routine whenever the end-effector or part changes. **My calibration residual is tiny but the robot is still inaccurate on new points. What happened?** Classic over-fitting. You fit unobservable parameters that soaked up measurement noise — great training residual, terrible generalization. Check the identification Jacobian's condition number (should be tens to low hundreds, not thousands), fit only observable parameters (modified-DH + Hayati), and always validate on a hold-out pose set you didn't use to fit. **What accuracy can I realistically expect after calibration?** Kinematic calibration alone: ~0.10–0.20 mm absolute on a quality 6-axis arm (from ~0.5–2 mm uncalibrated). Adding joint-compliance (stiffness) and thermal compensation: ~0.05–0.10 mm. The repeatability floor (~0.02–0.05 mm) is the hard limit you can never beat. **Is ISO 9283 the only standard I need to know?** It's the core for static and path performance (AP, RP, AT, etc.). For service/mobile robots see ISO 18646; for collaborative-robot safety see ISO/TS 15066 and ISO 10218; for the metrology instruments, ASME B89.4.19 / ISO 10360-10 cover laser-tracker performance. For an industrial arm calibration, ISO 9283 is what you validate against. ## Changelog - **2026-06-03** — Initial publication. --- # LiDAR & Depth Cameras for Robots: The Ultimate Guide URL: https://blog.robo2u.com/posts/lidar-depth-cameras-ultimate-guide/ Published: 2026-06-02 Updated: 2026-06-20 Tags: lidar, depth-camera, time-of-flight, stereo-vision, structured-light, point-cloud, perception, slam, robotics-hardware, guide Reading time: 36 min > A working engineer's guide to 3D perception for robots: LiDAR ranging and architectures (mechanical, MEMS, flash, FMCW), stereo vs structured light vs ToF depth cameras, the numbers that matter, point clouds, SLAM, and how to pick a sensor. A camera tells you where something is in the image. A 3D sensor tells you where something is in the world. That difference — pixels to metres — is the entire reason a robot can drive through a doorway it has never seen, pick a part out of a bin, or stop before it amputates someone's foot. Take the depth away and you are back to a system that recognizes a coffee mug beautifully and then drives its gripper straight through the table. This guide is about the two sensor families that give robots that metric, three-dimensional view of the world: LiDAR and depth cameras. We will go through how the ranging actually works (time-of-flight, triangulation, frequency-modulated continuous wave), why a 905 nm laser and a 1550 nm laser are not interchangeable, how mechanical spinners differ from MEMS and flash and FMCW, when a stereo pair beats a structured-light projector, and what every one of those technologies does the moment you take it outside into direct sun. Then we get concrete about real hardware — Ouster, Livox, Hesai, Slamtec, Intel RealSense, Stereolabs ZED, Microsoft/Orbbec Femto, Luxonis OAK, Basler — and how to choose. **The take**: there is no "best" 3D sensor, there is only the sensor matched to your range, your lighting, your accuracy budget, and your compute budget — and most failed perception stacks are a sensor-choice mistake made before a single line of code was written. Indoors at 0.5–6 m you almost always want a depth camera; outdoors past 10 m in sun you almost always want LiDAR; the interesting engineering is in the overlap, and in fusing the two so each covers the other's blind spots. Companion reading: [robot sensors](/posts/robot-sensors-ultimate-guide/), [machine vision](/posts/machine-vision-ultimate-guide/), [mobile robots (AMR/AGV)](/posts/mobile-robots-amr-agv-ultimate-guide/), and [ROS 2](/posts/ros2-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [Why robots need 3D perception](#why-3d) 3. [LiDAR fundamentals: how ranging actually works](#lidar-fundamentals) 4. [LiDAR architectures: spinning, MEMS, flash, FMCW](#lidar-architectures) 5. [Depth-camera technologies head-to-head](#depth-tech) 6. [Stereo vision deep-dive](#stereo) 7. [Structured light](#structured-light) 8. [Time-of-flight cameras](#tof-cameras) 9. [The numbers that matter](#numbers) 10. [Point clouds and data](#point-clouds) 11. [Where each sensor fits](#where-it-fits) 12. [SLAM and sensor fusion](#slam) 13. [Selecting a 3D sensor](#selecting) 14. [Frequently asked questions](#faq) ## Key takeaways - **Depth is what turns recognition into action.** A 2D camera classifies; a 3D sensor gives you metric geometry — the (x, y, z) a planner needs to avoid obstacles and a gripper needs to grasp. See the [robot sensors guide](/posts/robot-sensors-ultimate-guide/) for where this fits in the exteroception family. - **LiDAR measures range by timing light.** Direct time-of-flight (dToF) clocks a pulse's round trip; FMCW measures a frequency beat. At `c ≈ 3×10⁸ m/s`, light covers 1 m round-trip in about 6.7 ns, so picosecond timing matters. - **905 nm vs 1550 nm is an eye-safety and range trade.** 905 nm uses cheap silicon detectors but is capped on power by retinal safety; 1550 nm is absorbed by the cornea so it tolerates far higher power (longer range in sun) but needs expensive InGaAs detectors. - **LiDAR architectures trade moving parts for cost and FoV.** Mechanical spinners give 360° but wear out; MEMS/solid-state and flash kill the spin but narrow the field of view; **FMCW** adds per-point velocity and immunity to other LiDARs and sunlight. - **Depth cameras come in three flavours.** Passive/active **stereo** (RealSense, ZED), **structured light** (original Kinect, Orbbec), and **ToF** (Azure Kinect, Femto). Stereo scales range with baseline; structured light is most accurate up close; ToF gives dense depth fast but suffers multipath. - **Stereo error grows with the square of distance.** Depth `Z = f·B / d`; error `ΔZ ≈ Z²·Δd / (f·B)`. Double the range and you quadruple the error unless you widen the baseline `B` or lengthen the focal length `f`. - **Structured light dies in sunlight.** Its projected pattern (a few mW) is washed out by ~1000 W/m² of solar irradiance. Indoors at 0.3–2 m it is the accuracy king; outdoors it is useless. - **The numbers that decide everything**: range, accuracy *and* precision vs distance, field of view, angular/spatial resolution, frame or point rate, minimum range, sunlight performance, and power. A sensor good at six of these and bad at the seventh you care about is the wrong sensor. - **Point clouds are expensive.** A 128-line LiDAR at ~2.6 M points/s is real bandwidth and real CPU. Voxel-grid downsampling, pass-through filters, and region-of-interest cropping are not optional — see [real-time control](/posts/real-time-control-systems-ultimate-guide/). - **Sensor choice maps to robot class.** Indoor AMR: 2D LiDAR + a depth cam. Outdoor/AV: 3D LiDAR + cameras + radar. Manipulator: wrist or overhead depth cam. Humanoid: all of the above, fused — see the [humanoid hardware guide](/posts/humanoid-robot-hardware-ultimate-guide/). - **SLAM is the consumer of all this.** LiDAR SLAM is geometric and robust; visual SLAM is cheap and feature-rich; the strong systems fuse both plus IMU and lean on loop closure to kill drift. - **Integrate through ROS 2.** Almost every sensor here ships a ROS 2 driver publishing `sensor_msgs/PointCloud2`, `Image`, and `CameraInfo`. Budget for the driver's quirks as much as the sensor's specs — see the [ROS 2 guide](/posts/ros2-ultimate-guide/). ## Why robots need 3D perception A robot has to answer three questions before it does anything physical: *Where am I? What is around me? Where exactly is the thing I want to touch?* All three are geometry questions, and geometry needs depth. Localization and navigation need depth because a planner reasons in metres, not pixels. An obstacle two pixels tall could be a speck of dust on the lens or a forklift 30 m away; only range disambiguates. Manipulation needs depth because a grasp pose is a 6-DoF transform in the robot's frame — you cannot servo a gripper to a 2D bounding box. And safety needs depth because the entire concept of a "protective stop at 0.8 m" is meaningless without a metric distance. > **Rule of thumb**: if a downstream module reasons in metres — planning, grasping, collision checking, safety zones — it needs a sensor that *measures* metres, not one that infers them from appearance. ### The exteroception family 3D sensing is one branch of a robot's **exteroception** — its sensing of the external world. The full family includes contact and force sensors, proximity sensors, 2D cameras, radar, sonar, and the 3D sensors covered here. The [robot sensors guide](/posts/robot-sensors-ultimate-guide/) lays out the whole taxonomy; this article zooms into the depth-producing members. The reason 3D sensors get their own deep treatment is that depth is uniquely hard and uniquely valuable. A 2D camera is a passive, cheap, dense, high-resolution sensor — and it throws away the one dimension a robot's body lives in. Recovering that dimension is what LiDAR and depth cameras exist to do, and they do it by physically different tricks, each with a different failure mode. ### Active vs passive sensing The deepest split is **active** versus **passive**. A passive sensor (a plain camera, a stereo pair with no projector) only collects ambient light. An active sensor (LiDAR, structured light, ToF, active stereo) emits its own light and measures what comes back. Passive sensing is cheap, silent on the spectrum, and works at any range the optics allow — but it fails where the scene gives it nothing to work with (a blank white wall, a dark room). Active sensing carries its own illumination, so it works in the dark and on featureless surfaces — but it costs power, can interfere with copies of itself, and fights a losing battle against the sun outdoors. Almost every trade-off in this guide is a consequence of that one split. ## LiDAR fundamentals: how ranging actually works LiDAR — **Li**ght **D**etection **a**nd **R**anging — measures distance by timing or phase-tracking light it emits. Strip away the spinning and the optics and a LiDAR is a laser, a photodetector, and a very fast clock. ### Direct time-of-flight (dToF) The textbook method. Fire a short laser pulse, start a timer, wait for the reflection, stop the timer. Distance is half the round trip: ```text Range: R = (c · t) / 2 c = speed of light ≈ 2.998 × 10⁸ m/s t = round-trip time of flight Example: a target at 100 m round-trip distance = 200 m t = 200 / 2.998e8 ≈ 667 ns Timing resolution needed for 1 cm range resolution: Δt = 2 · ΔR / c = 2 · 0.01 / 2.998e8 ≈ 67 ps ``` That 67 ps figure is the whole engineering challenge of dToF: to resolve centimetres you need picosecond-class timing electronics, typically a time-to-digital converter (TDC) and avalanche photodiodes (APDs) or single-photon avalanche diodes (SPADs). It is also why LiDAR is fundamentally an *interval* measurement, not an intensity one — it does not care how bright the return is, only when it arrives, which is why it is far more robust to surface reflectivity than a camera. ### Amplitude-modulated continuous wave (AMCW / phase) Cheaper short-range LiDARs and most iToF cameras instead modulate the laser amplitude as a continuous sine wave and measure the **phase shift** between emitted and received light. Phase wraps every half wavelength of the modulation, which sets an unambiguous range: ```text Phase ToF: R = (c / (4π·f_mod)) · φ f_mod = modulation frequency φ = measured phase shift (radians) Unambiguous range: R_max = c / (2 · f_mod) f_mod = 20 MHz → R_max = 7.5 m f_mod = 100 MHz → R_max = 1.5 m ``` Higher modulation frequency buys precision but shrinks the unambiguous range — beyond `R_max` the phase wraps and a 9 m target reads as 1.5 m. Multi-frequency schemes (combining, say, 20 MHz and 80 MHz) recover a longer unambiguous range while keeping precision. ### FMCW: frequency-modulated continuous wave The newest production approach. Instead of pulses, FMCW sweeps the laser frequency (a chirp) and mixes the return with the outgoing light. The beat frequency encodes range, and any Doppler shift encodes **radial velocity** — you get per-point speed for free. FMCW is coherent detection, so it is nearly immune to sunlight and to other LiDARs (only light correlated with *its own* chirp produces a beat). More on this below; it is the headline architecture of Aeva and the long-range automotive players. ### The laser and the detector The emitter is a laser diode — edge-emitting or, increasingly, a **VCSEL** (vertical-cavity surface-emitting laser) array for flash and solid-state units. The detector is a photodiode: a PIN diode for cheap close range, an **APD** for sensitivity, or a **SPAD** array for photon-counting dToF (the technology behind Ouster's digital LiDAR and many automotive flash units). > **Rule of thumb**: SPAD/CMOS digital LiDAR trades the analog finesse of a tuned APD for the scaling, calibration stability, and cost curve of a semiconductor process. That bet is why Ouster's per-channel cost fell while channel counts climbed. ### 905 nm vs 1550 nm and eye safety Two wavelengths dominate, and the choice cascades through the whole sensor. **905 nm** sits at the edge of silicon's sensitivity, so it uses cheap silicon APDs/SPADs — the same process economics as camera sensors. The catch is eye safety: 905 nm passes through the eye and focuses on the **retina**, so Class 1 eye-safe limits cap the optical power, which caps range, especially against low-reflectivity targets in bright sun. **1550 nm** is strongly absorbed by water in the **cornea** and never reaches the retina, so eye-safe limits allow far higher optical power — roughly two orders of magnitude more — translating to longer range and better sun robustness. The price: 1550 nm is invisible to silicon, so you need **InGaAs** detectors and fibre-laser or specialized diode sources, which are expensive. This is the classic automotive long-range trade: 1550 nm for the 200 m+ highway sensor, 905 nm for everything cost-sensitive. > **Rule of thumb**: 905 nm is the cost-and-volume wavelength; 1550 nm is the range-and-sun wavelength. If your spec sheet brags about 250 m at 10% reflectivity, it is almost certainly 1550 nm. ### Beam, divergence, and the "one point is a cone" problem A laser beam is not a line; it is a cone with some **divergence** (often 1–5 mrad). At range, that cone has real width — at 1 mrad, a beam is ~10 cm wide at 100 m. This sets your effective lateral resolution and means a single "point" is actually the centroid of whatever the beam footprint hit. It also produces edge artefacts: a beam straddling a near and a far object returns two echoes, which is why **multi-return** LiDAR (reporting the strongest, last, or several returns) matters for foliage, rain, and dust. ## LiDAR architectures: spinning, MEMS, flash, FMCW Having one laser-and-detector pair only measures one direction. To build a 2D or 3D picture you must steer that beam (or many beams) across the scene. *How* you steer it is the architecture, and it dictates field of view, durability, cost, and resolution. ### Mechanical spinning The original and still the workhorse. A stack of laser/detector pairs (the "channels" or "lines") rotates 360° on a motor — 10–20 Hz typically. Velodyne pioneered it; Ouster, Hesai, and RoboSense ship modern versions. You get a full 360° horizontal field of view and a vertical FoV set by the channel count and spacing (e.g. 32 or 64 or 128 lines spanning ~22–45° vertical). Strengths: full surround coverage, mature, well-understood point clouds. Weaknesses: a spinning motor is a wear item and a vibration source; the units are tall pucks; and per-unit cost historically ran into the thousands. The big shift of the last few years is **digital** spinning LiDAR (Ouster's SPAD-on-CMOS), which keeps the spin but replaces racks of analog channels with a semiconductor sensor — cheaper, more uniform, easier to calibrate. ### Solid-state and MEMS To kill the big spinning motor, MEMS LiDAR steers the beam with a tiny **micro-mirror** that tilts on silicon hinges. There is still a moving part, but it is microscopic and sealed. The trade is field of view: a MEMS mirror sweeps a *forward* cone (often ~120° horizontal, ~25° vertical), not 360°. Livox's risley-prism units (e.g. the Mid-360, Avia, HAP) and many automotive forward-looking units live here. They are cheaper, more rugged, and lower profile — at the cost of needing several to cover the surround a single spinner gives you. Livox in particular uses a **non-repeating scan pattern**: instead of fixed horizontal lines, the beam traces a flower-like pattern that fills in coverage the longer you dwell. This gives very dense clouds with integration time but means a single-frame snapshot is sparser and non-uniform — great for mapping, more awkward for instantaneous obstacle detection. ### Flash LiDAR No scanning at all. A single wide laser pulse floods the whole scene (like a camera flash) and a 2D SPAD/APD detector array times the return at every pixel simultaneously. This is mechanically bulletproof — zero moving parts — and captures a full frame in one shot, ideal for fast-moving scenes. The catch is the **range-resolution-FoV** triangle: spreading finite laser energy over a wide field starves each pixel, so flash units are short-to-medium range or narrow FoV. They shine as close-range automotive corner sensors and on spacecraft (where they do terrain-relative navigation and docking). ### FMCW (and the velocity dividend) FMCW, introduced above, is as much an architecture as a ranging method because coherent detection changes the whole sensor design. Every point carries instantaneous radial velocity (Doppler), which is transformative for tracking moving objects and for ego-motion estimation. It is immune to sun and to other LiDARs. The downsides are cost and complexity — coherent optics and 1550 nm components are not cheap — and historically lower point rates, though that gap is closing. Aeva and a handful of automotive suppliers lead here. ### 2D vs 3D, and channel count A **2D LiDAR** has a single beam swept in one plane — it returns a slice (a ring of ranges at one height). This is the bread-and-butter indoor AMR/safety sensor: Slacmtec/RPLidar, SICK, Hokuyo. Cheap, low data rate, perfect for floor-level obstacle avoidance and 2D SLAM. A **3D LiDAR** stacks many beams (channels/lines) vertically — 16, 32, 64, 128 — to sample a volume. More channels means finer vertical resolution and denser clouds, and roughly linear cost and data-rate scaling. | Architecture | Moving parts | Typical FoV | Range | Velocity? | Relative cost | Best for | |---|---|---|---|---|---|---| | Mechanical spinning | Motor (macro) | 360° H × 22–45° V | 50–250 m | No | $$–$$$ | Surround perception, AVs, mapping | | Digital spinning (SPAD) | Motor (macro) | 360° H × 22–45° V | 50–200 m | No | $$ | Modern surround, lower cost/channel | | MEMS / solid-state | Micro-mirror | ~70–120° H × ~25° V | 50–300 m | No | $–$$ | Forward-looking, rugged, low profile | | Flash | None | ~30–120° H, narrow | 10–100 m | No | $$ | Close range, fast scenes, space | | FMCW | Varies | ~60–120° forward | 200–500 m | **Yes** | $$$$ | Long-range AV, ego-motion, interference-heavy | | 2D scanning | Motor (small) | 270–360° single plane | 8–40 m | No | $ | Indoor AMR, safety, 2D SLAM | ## Depth-camera technologies head-to-head A depth camera produces a per-pixel range image — a "depth map" — that pairs with the RGB image. There are three fundamentally different ways to compute that depth, and confusingly the marketing for all three says "3D camera." **Stereo vision** uses two cameras a fixed distance apart and triangulates depth from the disparity between the two views, exactly as human binocular vision does. **Active stereo** adds an infrared projector that throws texture onto blank surfaces so the matcher always has something to lock onto (Intel RealSense D400 series, Stereolabs ZED works passively). **Structured light** projects a *known* pattern (dots, stripes, or a coded sequence) and computes depth from how the pattern deforms over the scene's geometry. The original Microsoft Kinect (v1) and Orbbec/PrimeSense sensors are the canonical examples. It is extremely accurate at close range and helpless in sunlight. **Time-of-flight (ToF)** cameras put a flash-LiDAR-like principle into a camera: an IR emitter floods the scene and a special sensor measures round-trip time (dToF) or phase (iToF) at every pixel. The Microsoft Azure Kinect and its successor the Orbbec Femto are iToF; some automotive and phone sensors are dToF (with SPAD arrays). | Property | Stereo (passive/active) | Structured light | ToF (iToF/dToF) | |---|---|---|---| | Principle | Triangulation from disparity | Pattern deformation | Light round-trip time/phase | | Active light? | Optional (active stereo) | Yes (IR pattern) | Yes (IR flood) | | Close-range accuracy | Good | **Excellent (sub-mm to mm)** | Good | | Long-range scaling | **Best** (widen baseline) | Poor (pattern fades) | Moderate | | Sunlight outdoors | **Works** (passive especially) | Fails | Degrades badly | | Featureless surfaces | Fails (passive); OK (active) | **Works** | **Works** | | Frame rate | High (limited by matching) | Moderate | **High** | | Resolution | High (= camera sensor) | High | Lower (sensor-limited) | | Multipath / scattering | No | Some | **Yes (its worst flaw)** | | Typical robotics use | Outdoor + indoor, AMR, AGV | Bin-picking, scanning, close manipulation | Indoor mapping, people, gestures | | Example products | RealSense D455, ZED 2i, OAK-D | Orbbec, Photoneo, older Kinect v1 | Azure Kinect, Orbbec Femto | The one-line summary: **stereo for outdoors and range, structured light for close-range accuracy, ToF for fast dense indoor depth.** The rest of this guide explains why each is true and where each breaks. ## Stereo vision deep-dive Stereo is the most camera-like depth technology, which is exactly why robotics people reach for it first: it is passive, uses ordinary image sensors, scales to long range, and works in sunlight. ### Disparity and the depth equation Two cameras separated by a **baseline** `B` see the same point at slightly different horizontal pixel positions. That difference is the **disparity** `d`. Depth follows from similar triangles: ```text Stereo depth: Z = (f · B) / d Z = depth (m) f = focal length (pixels) B = baseline (m) d = disparity (pixels) Example: f = 700 px, B = 0.12 m (ZED 2i-ish) d = 40 px → Z = 700 · 0.12 / 40 = 2.10 m d = 10 px → Z = 700 · 0.12 / 10 = 8.40 m d = 4 px → Z = 700 · 0.12 / 4 = 21.0 m ``` Notice that disparity falls off fast with distance: far objects have tiny disparity, and at some point the disparity drops below one pixel and you simply cannot measure it. That is the stereo range ceiling. ### Why error grows with the square of range Differentiate the depth equation and you get the single most important fact about stereo: ```text Depth error: ΔZ ≈ (Z² / (f · B)) · Δd Δd = disparity matching error (≈ 0.1–0.5 px for good matchers) Example: f = 700 px, B = 0.12 m, Δd = 0.2 px at Z = 2 m : ΔZ ≈ (4 / 84) · 0.2 ≈ 0.0095 m (~1 cm) at Z = 8 m : ΔZ ≈ (64 / 84) · 0.2 ≈ 0.152 m (~15 cm) at Z = 20 m: ΔZ ≈ (400 / 84) · 0.2 ≈ 0.95 m (~1 m) ``` Depth error scales with **Z²**. Go twice as far and your error quadruples. This is not a defect to be tuned away — it is geometry — and it dictates how you size a stereo rig: to push usable range out, you widen the baseline `B` or lengthen the focal length `f` (narrower FoV). A robot that needs accurate depth at 15 m needs a wide-baseline rig (the ZED 2i is 120 mm; long-range survey rigs go to a metre or more), not a 50 mm webcam-style pair. > **Rule of thumb**: stereo accuracy is set before runtime by baseline and focal length. No matter how good your matcher is, `ΔZ ∝ Z² / (f·B)`. Choose the rig for the range you need. ### Calibration Stereo lives and dies on calibration. You need each camera's intrinsics (focal length, principal point, distortion) and the extrinsics between them (the exact relative pose), then you **rectify** so corresponding points lie on the same image row — which turns the 2D match into a 1D search and is what makes real-time stereo feasible. A rig knocked out of calibration by a thermal cycle or a bump produces depth that is confidently, smoothly wrong. Factory-calibrated, rigid-baseline modules (RealSense, ZED, OAK-D) exist precisely so you do not hand-calibrate two loose cameras and chase drift forever. ### The texture problem and active IR Passive stereo needs **texture** to match — distinct features in both images. Point it at a blank white wall, a glossy panel, or a dim corridor and the matcher has nothing to correlate, so depth comes back full of holes. The fix is **active stereo**: an IR projector (a static dot pattern) sprays artificial texture onto the scene. Crucially the matcher does not need to *decode* the pattern (that is structured light's job) — it just needs the extra contrast. Intel RealSense D400 series is the canonical active-stereo line: it works in the dark, on blank walls, *and* still works in sunlight because if there is enough natural texture it falls back to passive matching. That dual nature is why active stereo is the most versatile indoor/outdoor depth camera family. ## Structured light Structured light projects a **known, coded** pattern — stripes, a pseudo-random dot cloud, or a temporal sequence of patterns — and recovers depth from how that pattern bends over the scene. Because the pattern is known, a single matched feature gives an absolute, high-precision depth, which is why structured light owns the close-range accuracy crown. ### How it achieves accuracy The geometry is triangulation again (projector and camera form the "stereo" pair, one of them replaced by a light source), but the known pattern removes the matching ambiguity that limits passive stereo. With temporally coded patterns (project N shifted patterns, decode per-pixel phase) you can hit **sub-millimetre** depth precision at 0.3–1 m. This is why industrial 3D scanners and high-end bin-picking sensors (Photoneo PhoXi, Zivid) are structured-light: when you need to find a 2 mm chamfer on a part in a bin, nothing else is this precise. ### Why it fails in sunlight The projected pattern is a few milliwatts of IR. Direct sunlight delivers roughly **1000 W/m²** across the spectrum, a chunk of it in the near-IR band the sensor uses. The sun simply overwhelms the projected pattern's contrast — the camera sees sun-flooded pixels, the code is unreadable, and depth collapses. No amount of clever coding beats a four-orders-of-magnitude irradiance gap. Structured light is therefore an **indoor** technology, full stop. It also degrades with multiple units in the same space (patterns interfere) unless they are time-multiplexed or use distinct codes. ### Single-shot vs multi-shot **Multi-shot** (temporal coding) is the most accurate but needs a static scene during capture — motion smears the code. **Single-shot** (a spatially coded pattern decoded from one frame, like Kinect v1's dot cloud) tolerates motion and runs at video rate but is less precise. Choose by whether your scene holds still: a scanner on a static part bin can multi-shot; a sensor on a moving conveyor must single-shot. ## Time-of-flight cameras A ToF camera is, loosely, a flash LiDAR packaged as a camera: an IR emitter floods the whole scene and a specialized 2D sensor measures the round trip at every pixel at once. The result is a dense depth image at high frame rate with no baseline-dependent error — depth is measured directly, not triangulated, so accuracy does not blow up with `Z²` the way stereo does. ### iToF vs dToF **Indirect ToF (iToF)** modulates the emitter as a continuous wave and measures **phase shift** per pixel (the AMCW math from the LiDAR section). It is the mainstream camera approach — Microsoft Azure Kinect and Orbbec Femto are iToF — giving good resolution and precision indoors at 0.5–5 m. Its weaknesses are phase **wrapping** (handled with multi-frequency) and sensitivity to multipath. **Direct ToF (dToF)** times individual photons with SPAD arrays, exactly like dToF LiDAR. It is more robust to multipath and ambient light and scales to longer range, but historically at lower pixel resolution. It is the technology in phone LiDAR sensors and an increasing share of automotive flash units. The lines are blurring as SPAD pixel counts climb. ### Multipath: the ToF sensor's signature failure ToF's worst enemy is **multipath interference**. The emitted light does not only travel straight to a surface and back — it also bounces off other surfaces and arrives late, corrupting the phase/time measurement. The textbook case is a **concave corner**: light bounces wall-to-wall before returning, and the corner reads as rounded or pushed back. Shiny floors, retroreflectors, and translucent objects produce similar errors. This is intrinsic to flood illumination and is the reason a structured-light or stereo sensor can beat a ToF sensor on a geometrically tricky scene even when the ToF sensor has better nominal precision. ### Ambient light, resolution, and frame rate ToF cameras compete with ambient IR. Indoors they are excellent; in direct sun the IR background eats dynamic range and depth degrades sharply (better than structured light, worse than passive stereo). Resolution is sensor-limited and historically lower than RGB — the Azure Kinect's depth sensor runs up to 1024×1024 in narrow FoV mode, 640×576 wide — but frame rates are high (30 fps typical, sometimes more) and latency is low, which is why ToF wins for gesture, people-tracking, and fast indoor mapping. ```text ToF range from phase (iToF): R = (c / (4π·f_mod)) · φ ToF range from time (dToF): R = (c · t) / 2 Frame-to-depth budget at 30 fps: per-frame time = 1/30 s ≈ 33 ms iToF often captures multiple sub-frames (phase steps) within that window → fast motion within the 33 ms smears depth ("motion blur" in Z) ``` ## The numbers that matter Spec sheets are written to flatter. Here is the engineer's checklist — the parameters that actually decide whether a sensor works in your application, with what to watch for on each. ### Range (and at what reflectivity) Maximum range is meaningless without a **target reflectivity**. A LiDAR rated "200 m" usually means against a 80–90% reflective target; the honest number is the range against a **10% reflective** (dark, matte) target, which can be half or less. Always ask "range at 10%." For depth cameras, range is bounded by the technology: structured light to ~2–5 m, ToF to ~5–8 m, stereo to whatever your baseline supports (5–20+ m). ### Accuracy vs precision (vs distance) These are different and both matter. **Accuracy** is how close the mean measurement is to truth (bias); **precision** (or repeatability) is the spread of repeated measurements (noise). A sensor can be precise but inaccurate (consistent 3 cm offset) or accurate but noisy. Both degrade with distance — for stereo as `Z²`, for ToF more gently, for LiDAR roughly flat until SNR collapses. Demand the curve, not a single headline number. ### Field of view Horizontal × vertical FoV sets how much of the world you see per frame. Wide FoV (good for obstacle awareness) trades against angular resolution and range (energy spread thinner). A 360° spinner sees everything; a forward MEMS unit sees a cone; a depth camera sees a frustum (commonly 70–90° H). Mounting a wide-FoV sensor solves "I have a blind spot" far more cheaply than adding a second narrow one. ### Resolution: angular and spatial For LiDAR, **angular resolution** (degrees between adjacent points, e.g. 0.1–0.4° horizontal, set by channels for vertical) determines how far away you can resolve a given object. For depth cameras, spatial resolution is the depth-map size (e.g. 640×480, 1280×720). More resolution is more detail and more compute; match it to the smallest feature you must detect at your working range. ### Frame rate / point rate LiDAR quotes **points per second**; cameras quote **fps**. Both are throughput. A 128-line spinner at 20 Hz over ~1024 horizontal samples and dual return is on the order of: ```text LiDAR point rate: points/s = channels × horizontal_samples × rotation_Hz × returns 128 ch × 1024 az × 10 Hz × 2 returns = 2,621,440 pts/s ≈ 2.6 M pts/s Bandwidth (XYZ + intensity, 16 bytes/point): 2.6e6 × 16 ≈ 42 MB/s sustained ``` That is real load on your bus and CPU — see [point clouds and data](#point-clouds). ### Minimum range (the forgotten spec) Every active sensor has a **blind zone** up close where the return saturates or the geometry breaks. Structured-light and ToF sensors often cannot measure inside 0.2–0.3 m; a wide-baseline stereo rig loses near objects because they fall outside both frustums. For a wrist-mounted manipulation camera, *minimum* range is frequently the binding constraint, not maximum — you cannot grasp what is too close to see. ### Sunlight performance The great divider. Passive stereo: works (it loves texture and sunlight provides it). LiDAR: 905 nm degrades, 1550 nm and FMCW shrug it off. ToF: degrades significantly. Structured light: fails. If any part of your robot's life is outdoors in daylight, this single row of the spec table eliminates half the candidates before you read anything else. ### Power and thermal LiDARs draw 8–25 W and run warm; depth cameras draw 1–5 W over USB but the IR projector and the on-board depth ASIC add heat in a sealed enclosure. On a battery robot, sensor power is a real fraction of the budget, and thermal throttling of a depth ASIC in a hot enclosure is a classic field failure. > **Rule of thumb**: pick the one or two numbers that *bind* your application (often minimum range and sunlight for manipulators; range-at-10% and angular resolution for outdoor mobile) and treat the rest as tie-breakers. A sensor strong everywhere except your binding spec is the wrong sensor. ## Point clouds and data The output of a 3D sensor is a **point cloud**: a set of (x, y, z) points, often with intensity, ring index, timestamp, or RGB. It is the universal currency of 3D perception, and it is heavy. ### Formats The common containers: **PCD** (Point Cloud Library's native format), **PLY** (interchange/scanning), **LAS/LAZ** (geospatial/survey), and in robotics the live wire format is ROS 2's `sensor_msgs/PointCloud2` — a packed binary buffer with a field descriptor. Depth cameras alternatively publish a `depth Image` (a 16-bit-per-pixel range map) plus `CameraInfo`, which you reproject to a cloud only when you need 3D — cheaper to move a depth image than a full cloud. ### Density and the data-rate problem Density is points per unit area at a given range, and it falls off with distance (the beam fan diverges). The earlier 2.6 M points/s, ~42 MB/s figure is per sensor — put three on a robot and you have an internal bandwidth and CPU problem before you have written a single perception algorithm. A naive nearest-neighbour query over a million-point cloud is murder; everything downstream assumes you have **reduced** the cloud first. ### Downsampling, voxels, and cropping The standard toolkit, in order of how often you reach for it: - **Pass-through / ROI crop** — discard points outside a box of interest (e.g. ignore everything above 2 m or beyond 10 m). Cheapest, biggest win. - **Voxel grid** — overlay a 3D grid of cubes (e.g. 5 cm), replace all points in a cube with their centroid. Uniform density, dramatic point reduction, the default first step. - **Statistical outlier removal** — drop points whose neighbour distances are anomalous (kills sensor speckle and rain returns). - **Random / uniform subsampling** — when you just need fewer points and do not care which. Doing this on time is a real-time-systems problem: the filters must keep up with the sensor or your buffers back up and latency climbs — see [real-time control](/posts/real-time-control-systems-ultimate-guide/). And the perception that runs on the reduced cloud (segmentation, detection) is the bridge back to 2D methods covered in the [machine vision guide](/posts/machine-vision-ultimate-guide/), increasingly via networks that consume raw points or voxelized clouds directly. > **Rule of thumb**: never run an algorithm on the raw cloud. Crop to your region of interest, then voxel-downsample to the coarsest resolution your task tolerates. A 5 cm voxel grid often cuts points 10–50× with no loss for navigation. ## Where each sensor fits The clean way to choose is by robot class, because the class fixes range, lighting, and the task. ### Indoor AMR: 2D LiDAR (+ a depth camera) An autonomous mobile robot rolling around a warehouse or hospital wants cheap, reliable, floor-level obstacle sensing and 2D SLAM. A single 2D LiDAR (Slamtec RPLidar, SICK, Hokuyo) at 270–360°, 8–25 m, ~10 Hz does the navigation. It is blind to anything off its scan plane — a tabletop, a forklift fork at chest height — so you add a forward-facing depth camera (often a RealSense or OAK-D) to catch overhangs and low obstacles. This 2D-LiDAR-plus-depth-cam pairing is the default AMR stack; the [mobile robots guide](/posts/mobile-robots-amr-agv-ultimate-guide/) covers the navigation side in depth. ### Outdoor / autonomous vehicle: 3D LiDAR + cameras + radar Outdoors, at speed, in sun and weather, you need long range, surround coverage, and redundancy. A 3D spinning or solid-state LiDAR (Hesai, Ouster, RoboSense, or FMCW for the long-range channel) provides metric geometry to 100–250 m; cameras add semantics and colour; radar adds velocity and all-weather robustness. No single sensor is trusted alone — the architecture is explicitly **redundant and fused** because the failure modes are uncorrelated (LiDAR struggles in heavy rain/dust, cameras in glare/dark, radar at fine resolution). ### Manipulation: depth camera on the wrist or overhead A robot arm picking parts needs accuracy at 0.3–1.5 m, not range. Mount a depth camera either **eye-in-hand** (on the wrist, moving with the gripper for close inspection and active viewpoint selection) or **eye-to-hand** (fixed overhead, stable world frame). For high-precision bin-picking, structured light (Photoneo, Zivid) wins on accuracy; for general pick-and-place, active stereo or ToF is faster and cheaper. The grasp pose this produces feeds the kinematics and planning covered in the [motion planning & kinematics guide](/posts/motion-planning-kinematics-ultimate-guide/), and the gripper choice in the grippers/end-effector literature. Minimum range and the eye-in-hand calibration (hand-eye transform) are the usual integration headaches. ### Humanoid: multi-sensor, fused A humanoid does all of the above — navigate, perceive obstacles at varying heights, and manipulate — so it carries a suite: a head depth camera or two for manipulation and near-field, often a LiDAR or 360° camera ring for locomotion awareness, plus an IMU for the balance loop. The defining problem is **fusion across a moving, articulated body**: every sensor's pose changes as the robot walks, so the transform tree (and its timing) is as critical as any single sensor. The [humanoid hardware guide](/posts/humanoid-robot-hardware-ultimate-guide/) covers the platform; the takeaway here is that humanoids are the ultimate sensor-fusion problem, not a single-sensor problem. ## SLAM and sensor fusion A 3D sensor produces geometry in its own frame. **SLAM** (Simultaneous Localization And Mapping) is what turns a stream of those frames into a consistent map and a robot pose within it. It is the dominant consumer of LiDAR and depth data on mobile robots. ### LiDAR SLAM Geometric and robust. Algorithms like LOAM/LIO-SAM (LiDAR-inertial) and point-to-plane ICP variants register successive scans by matching geometry — edges, planes, surfaces. LiDAR SLAM is accurate, works in the dark, and is largely lighting-independent, which is why it dominates outdoor and large-scale mapping. Its weaknesses are geometrically degenerate environments (a long featureless corridor or tunnel where every scan looks the same) and the cost/bulk of the sensor. ### Visual SLAM Cheap and feature-rich. ORB-SLAM, VINS-Fusion, and similar track visual features (or direct pixel intensities) across frames, often fused with an IMU (visual-inertial odometry). Cameras are cheap, light, low-power, and carry semantics LiDAR cannot. The weaknesses mirror cameras': they fail in the dark, in low texture, and under rapid lighting change, and monocular visual SLAM has an inherent **scale ambiguity** (you do not know absolute metres without a second camera, a depth sensor, or an IMU to anchor scale). ### Fusion and loop closure The strong systems fuse: LiDAR for metric geometry, camera for texture/semantics, IMU for high-rate motion between frames. Fusion fills each sensor's blind spots — the IMU bridges the gap when LiDAR sees a featureless wall; the camera resolves which way a symmetric corridor actually goes. Every SLAM system fights **drift**: small per-frame errors accumulate into a map that bends. **Loop closure** is the fix — recognizing a previously visited place and adding a constraint that snaps the accumulated error back into consistency. Reliable loop closure (visual bag-of-words, LiDAR scan-context descriptors) is what separates a map that closes neatly when you return to the start from one that shows two offset copies of your office. The pose estimate this produces feeds straight into the planner — see the [motion planning & kinematics guide](/posts/motion-planning-kinematics-ultimate-guide/). > **Rule of thumb**: odometry tells you how far you have moved; loop closure tells you where you actually are. A SLAM system without robust loop closure is just dead reckoning with extra steps. ## Selecting a 3D sensor Choose in this order — each criterion eliminates candidates before the next: **range** → **lighting** → **accuracy** → **field of view** → **budget** → **integration**. ### The decision flow 1. **Range and minimum range.** Indoor close (0.3–2 m)? Depth camera. Indoor mid (2–8 m)? Depth camera or 2D LiDAR. Outdoor or beyond 10 m? LiDAR. Check the *minimum* range against your closest target. 2. **Lighting.** Any direct sun? Eliminate structured light immediately; favour passive/active stereo or 1550 nm/FMCW LiDAR. Dark or featureless indoors? Eliminate passive stereo; use active stereo, ToF, or LiDAR. 3. **Accuracy.** Sub-mm at close range for inspection/bin-picking? Structured light. Centimetres for navigation? Almost anything. Remember stereo's `Z²` error growth. 4. **Field of view.** Need 360°? Spinning LiDAR or a camera ring. A forward cone is enough? MEMS LiDAR or a single depth camera. 5. **Budget and power.** 2D LiDAR and depth cameras are cheap and low-power; 3D and FMCW LiDAR are not. 6. **Integration.** A ROS 2 driver, good documentation, and a stable point-cloud timestamp are worth more than 5% on any spec. ### Real-product comparison Representative 2026 products with defensible figures (always confirm against the current datasheet — variants differ): | Product | Type | Range (typ) | FoV (H×V) | Resolution / channels | Rate | Notes | |---|---|---|---|---|---|---| | Slamtec RPLidar A3 | 2D LiDAR | ~25 m | 360° | 0.225° ang. | 10–20 Hz | Cheap indoor AMR / 2D SLAM | | Ouster OS1-128 | 3D digital spinning | ~120–170 m | 360° × 45° | 128 ch | 10–20 Hz | SPAD/CMOS, ~2.6 M pts/s | | Hesai Pandar XT32 | 3D spinning | ~120 m | 360° × 31° | 32 ch | 10–20 Hz | Robust mid-range mobile | | Livox Mid-360 | Solid-state (prism) | ~40–70 m | 360° × 59° | non-repeating | 10 Hz | Low cost, dense w/ integration | | Intel RealSense D455 | Active stereo | 0.6–6 m | ~87° × 58° | up to 1280×720 depth | up to 90 fps | Works in sun + dark; 95 mm baseline | | Stereolabs ZED 2i | Passive stereo | 0.3–20 m | ~110° | up to 2208×1242 | 15–100 fps | 120 mm baseline; outdoor range | | Luxonis OAK-D Pro | Active stereo + NPU | 0.3–12 m | ~80° | 1280×800 depth | ~30–60 fps | On-board AI inference | | Microsoft Azure Kinect / Orbbec Femto | iToF | 0.25–5.5 m | 75°×65° (wide) | up to 1024×1024 depth | 30 fps | Dense indoor depth; multipath-prone | | Photoneo PhoXi | Structured light | 0.4–2 m | scanner | sub-mm | ~few Hz | Bin-picking accuracy king | (Figures are nominal and configuration-dependent; "range" for LiDAR is at favourable reflectivity unless noted.) ### Integration notes (ROS 2) Nearly every sensor above ships a ROS 2 driver. The patterns to know: - **LiDAR** publishes `sensor_msgs/PointCloud2` (and often a per-point timestamp/ring field crucial for de-skewing motion). Ouster, Hesai, Livox, and Slamtec all maintain ROS 2 drivers; Livox uses its own `CustomMsg` you usually convert. - **Depth cameras** publish a `depth Image`, a `CameraInfo`, and optionally a `PointCloud2`. The `realsense2_camera`, `zed_ros2_wrapper`, and `depthai-ros` packages are the standard wrappers. - **Time synchronization** is the silent killer: if your LiDAR, camera, and IMU timestamps are not on the same clock (PTP/hardware sync or careful host-side stamping), fusion and SLAM degrade in ways that look like sensor noise but are really timing. Solve clocking before you blame the algorithm. - **TF tree**: every sensor needs an accurate static (or dynamic, for articulated bodies) transform to the robot base. A 2 cm or 1° error in a sensor mount becomes a systematic depth error downstream. The [ROS 2 guide](/posts/ros2-ultimate-guide/) covers the middleware, QoS, and time-handling that make or break a multi-sensor perception stack. > **Rule of thumb**: budget as much engineering time for the driver, timestamps, and TF tree as for selecting the sensor. The hardware rarely fails; the integration usually does. ## Frequently asked questions **Do I need LiDAR if I already have a depth camera?** Often no, indoors and at short range — a good active-stereo or ToF camera covers 0.3–6 m densely and cheaply. You need LiDAR when you go outdoors in sun, need range beyond ~10 m, need 360° coverage, or need lighting-independent geometry for robust SLAM. Many robots run both: LiDAR for the long/wide picture, depth cam for the close/dense one. **Why does my depth camera have holes in the depth image?** Holes mean the sensor got no usable measurement for those pixels. For passive stereo it is lack of texture (blank walls, glossy surfaces); for structured light or ToF it is sun saturation, an out-of-range surface, a specular reflection bouncing the light away, or a black/absorptive material. Active IR projection, lighting control, or a different technology fixes most of it. **905 nm or 1550 nm LiDAR — which should I buy?** For most robotics (indoor, mobile, mid-range) 905 nm is cheaper and entirely adequate. Choose 1550 nm when you need long range (200 m+), strong sun robustness, or higher optical power within eye-safe limits — typically automotive and outdoor long-range applications. You will pay substantially more for the InGaAs detector and laser. **What is the real difference between accuracy and precision for these sensors?** Accuracy is bias — how far the average reading is from truth. Precision is repeatability — how much repeated readings of the same point scatter. A sensor can be precise but biased (consistent 3 cm offset, correctable by calibration) or accurate but noisy (right on average, useless per-frame). Calibration fixes accuracy; averaging or a better sensor fixes precision. Specify both, versus distance. **Why is my ToF camera reading corners as rounded or pushed back?** Multipath. Light bounces between the two walls of the corner and arrives late, corrupting the per-pixel time/phase measurement. It is intrinsic to flood-illuminated ToF. Mitigations: multi-frequency capture, multipath-aware processing, or switching to structured light/stereo for geometrically tricky scenes. **Can stereo or structured light work outdoors?** Passive stereo: yes, and it often prefers sunlight because sun provides the texture it needs to match. Active stereo: yes, falling back to passive matching when the IR projector is washed out. Structured light: no — direct sun (~1000 W/m²) overwhelms the milliwatt projected pattern. ToF: degraded but sometimes usable in shade. **How far can a stereo camera actually measure?** It depends entirely on baseline `B` and focal length `f`, because `Z = f·B/d` and error grows as `Z²`. A 95–120 mm baseline module is good to roughly 6–20 m before error becomes unusable; survey rigs with metre-class baselines reach much further. There is no fixed answer — compute `ΔZ ≈ Z²·Δd/(f·B)` for your rig and your accuracy tolerance. **What sensor should I put on a robot arm for picking?** A depth camera, mounted eye-in-hand (on the wrist) or eye-to-hand (fixed overhead). For precision bin-picking of small or shiny parts, structured light (Photoneo, Zivid). For general pick-and-place, active stereo (RealSense, OAK-D) or ToF. The binding spec is usually *minimum* range and the hand-eye calibration, not maximum range. **Is FMCW LiDAR worth the premium?** If you need per-point velocity (instant moving-object detection, better ego-motion), strong immunity to sunlight and to other LiDARs, and long range, yes. For an indoor AMR or a short-range manipulator, no — you are paying for capabilities you will not use. It is an automotive and long-range outdoor technology today. **How do I keep point-cloud processing real-time?** Reduce the cloud before you process it: crop to your region of interest, then voxel-downsample (a 5 cm grid commonly cuts points 10–50× for navigation), then run outlier removal. Profile against the sensor's frame period — if a filter takes longer than 1/rate, buffers back up and latency grows. See the [real-time control guide](/posts/real-time-control-systems-ultimate-guide/). **LiDAR SLAM or visual SLAM?** LiDAR SLAM is more robust and lighting-independent — use it outdoors, in the dark, or where geometry is rich. Visual SLAM is cheaper, lighter, and carries semantics — good indoors with texture and on cost/weight-constrained platforms. The best systems fuse both with an IMU and rely on loop closure. Geometrically degenerate spaces (long corridors, tunnels) hurt LiDAR SLAM and favour fusion. **Why do my fused sensors disagree even though each one is calibrated?** Almost always timing or TF. If the sensors are not on a synchronized clock, a moving robot stamps the same world point at slightly different times, and fusion smears it. Likewise a small error in the static transform between sensors becomes a systematic offset. Fix clocking (PTP/hardware sync) and the TF tree before suspecting the sensors — see the [ROS 2 guide](/posts/ros2-ultimate-guide/). ## Changelog - **2026-06-02** — Initial publication. --- # Robot Power Systems & Batteries: The Ultimate Guide URL: https://blog.robo2u.com/posts/robot-power-batteries-ultimate-guide/ Published: 2026-05-30 Updated: 2026-06-20 Tags: robot-power, batteries, lithium-ion, lifepo4, bms, power-distribution, dc-dc, energy-density, robotics-hardware, guide Reading time: 36 min > A deep, practical guide to robot power systems: Li-ion vs LiFePO4 vs LiPo chemistries, cell/pack configuration, BMS, sizing for peak actuator current, 24/48V bus architecture, DC-DC, regen, charging, and safety — with worked sizing math. Every robot is, underneath the kinematics and the perception stack, an energy-management problem. You have a finite tank of joules bolted to a moving frame, and a set of loads — actuators, compute, sensors — that drain it at a rate that swings by 20:1 between idle and a hard acceleration. The job of the power system is to deliver the peak when the peak is demanded, survive the average for hours, and never once let the bus sag below the brownout threshold of the thing controlling the motors. Most field failures of mobile robots are not algorithm failures. They are a connector that browned out, a pack that hit over-discharge cutoff mid-task, or a BMS that tripped on overcurrent during a stall. This guide is about that system end to end: the chemistry in the cells, how cells become packs, the BMS that keeps the pack alive, how to size the pack from the load rather than from hope, the DC bus and its distribution, DC-DC conversion, regeneration, charging, and the safety envelope you cannot violate. We will keep numbers attached to units and opinions attached to reasons. **The take**: the battery is not a component you bolt on at the end — it is a system that should be sized from the *peak actuator current* and the *average power budget* simultaneously, and most robots are under-specified on the first and over-specified on the second. Pick the chemistry for the duty cycle, not the spec sheet's headline Wh/kg; size the pack so voltage sag under peak load never browns out your logic rail; and treat the BMS, fusing, and precharge as primary design elements, not afterthoughts. Get the bus voltage and the peak-current path right and everything downstream is easy. Get them wrong and you will chase intermittent resets forever. Companion reading: [robot actuators](/posts/robot-actuators-ultimate-guide/), [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/), [mobile robots (AMR/AGV)](/posts/mobile-robots-amr-agv-ultimate-guide/), and [real-time control systems](/posts/real-time-control-systems-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [The power budget mindset](#power-budget) 3. [Battery chemistries head-to-head](#chemistries) 4. [Cells, packs & configuration](#cells-packs) 5. [Battery Management Systems (BMS)](#bms) 6. [Sizing a battery from the load](#sizing) 7. [Power distribution architecture](#distribution) 8. [DC-DC conversion & regulation](#dcdc) 9. [Regeneration & braking](#regen) 10. [Charging](#charging) 11. [Safety: thermal runaway, protection, transport](#safety) 12. [Tethered & alternative power](#tethered) 13. [Selecting & integrating a power system](#selecting) 14. [Frequently asked questions](#faq) ## Key takeaways - A robot power system has two independent design drivers: the **average power** sets your energy (Wh) and therefore runtime; the **peak current** sets your cell chemistry, conductor sizing, and BMS rating. Size both, separately. - **The load is the actuator.** A motor at stall or hard acceleration can draw 5–10× its continuous current for hundreds of milliseconds. Your pack, fuse, BMS, and wiring must pass that transient without sagging the bus — see [robot actuators](/posts/robot-actuators-ultimate-guide/) and [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/). - **Li-ion NMC** (≈200–270 Wh/kg) wins on energy density and dominates legged/humanoid and weight-critical robots. **LiFePO4** (≈90–160 Wh/kg) wins on cycle life (2,000–6,000 cycles), safety, and flat discharge — the default for AMRs and AGVs that cycle daily for years. - **LiPo pouch** cells deliver the highest C-rates (10–100C+) for drones and combat robots but are the least mechanically and thermally forgiving. **NiMH** and **lead-acid** survive only in legacy and cost-floor applications. - A **BMS is mandatory** on any multi-cell lithium pack. It does cell balancing, over/under-voltage and overcurrent and over-temperature protection, and ideally SoC/SoH estimation and CAN/SMBus reporting. A pack without per-cell monitoring is a fire waiting for an excuse. - **Voltage sag under peak load** is the silent killer. `V_load = V_oc − I·R_internal`. A 24 V pack with 30 mΩ internal resistance drops 3 V at a 100 A peak — enough to brown out a 24 V logic supply with a 20 V undervoltage lockout. - **Bus voltage is a top-level architectural choice.** Higher voltage (48 V vs 24 V) means lower current for the same power, thinner cables, lower I²R loss, and smaller connectors — at the cost of more series cells and tighter safety/isolation rules. 48 V is the modern sweet spot for medium robots. - **Fusing, precharge, and e-stop are primary design elements.** Inrush into bulk capacitance can weld contactors and trip BMSs; a precharge resistor or soft-start is not optional above a few hundred microfarads on a high-voltage bus. - **DC-DC converters** isolate and regulate rails. Keep a clean, brownout-protected logic/compute rail separate from the noisy motor bus — a motor transient should never reset your real-time controller. See [real-time control systems](/posts/real-time-control-systems-ultimate-guide/). - **Regenerative braking** dumps decelerating-motor energy back into the bus. The pack absorbs it if it has headroom (SoC and charge-current limits); otherwise a **brake resistor** must burn it, or the bus voltage rises until something trips or fails. - **Charging is CC/CV** for lithium. Opportunity charging and auto-docking define AMR uptime far more than raw pack capacity — a robot that charges 10 min/hour at a dock can run a 24/7 duty cycle on a modest pack. - **What kills packs**: over-discharge below the floor, heat (every 10 °C above 25 °C roughly halves calendar life), overcharge, and mechanical/electrical abuse. Thermal runaway is the worst case and is self-sustaining once started — design to prevent, contain, and vent, not to extinguish. - **Transport is regulated.** Lithium cells/packs must pass **UN 38.3** testing to ship; this is a legal gate, not a nicety, and it constrains how you package and ship spares. ## The power budget mindset Before you pick a cell, write the power budget. Not the marketing version — the honest one, with two columns: **average power** and **peak power**, each in watts, for every load. These two numbers drive almost every downstream decision, and they pull in different directions. **Average power** sets your energy. If your robot averages 150 W and you want 4 hours of runtime, you need 600 Wh of *usable* energy, which after derating for depth-of-discharge, aging, and converter losses means a nameplate pack closer to 800–900 Wh. Average power is dominated by whatever runs continuously: drive motors at cruise, compute, sensors, cooling. **Peak power** sets everything about the current path. Peak power is dominated by transients: a drive motor accelerating, an arm joint lifting against gravity, a leg catching a fall. These transients are short — tens to hundreds of milliseconds — but they are brutal. A motor that pulls 5 A continuous can pull 40–50 A at stall, and your pack, BMS, fuse, and wiring all have to pass that without complaint. ### The actuator is the load In almost every robot, the actuators dominate both columns. A BLDC drive motor, a servo-grade joint, a harmonic-drive-geared arm axis — these are the things that turn electrons into motion, and they are the things that swing your current demand by an order of magnitude. Understand the load before you size the source. See the [robot actuators guide](/posts/robot-actuators-ultimate-guide/) for actuator types and the [motor controllers & FOC guide](/posts/motor-controllers-foc-ultimate-guide/) for how a drive actually pulls current from the bus. The key actuator fact for power design: **torque is proportional to current**, and a motor will pull whatever current it needs (up to its controller's limit) to make the commanded torque. At stall — zero speed, full torque — there is no back-EMF to limit current, so the only thing standing between the motor and a short circuit is the winding resistance and the controller's current limit. That is your peak. ``` Example peak draw for a single drive axis: Motor: Kt = 0.05 N·m/A, R_phase = 0.08 Ω Bus: 48 V Stall current (controller-limited): 60 A per axis Two drive axes accelerating together: 120 A peak from the bus Continuous cruise (each axis ~4 A): 8 A total Peak/average ratio on the motor bus: 120 / 8 = 15:1 ``` That 15:1 ratio is why you cannot size a robot's power system from average power alone. The pack might only need to *store* enough for 8 A of average draw, but it must *deliver* 120 A for half a second without sagging. ### Duty cycle is the bridge The thing that connects peak and average is **duty cycle**. A robot rarely sits at peak; it spends most of its time near average with brief excursions. RMS current — not average, not peak — is what determines pack heating: `I_rms = sqrt(mean(I²))`. A pack that spends 5% of its time at 120 A and the rest at 8 A has an RMS current of roughly `sqrt(0.05·120² + 0.95·8²) ≈ 28 A`, which is what you size cooling and continuous cell rating against. ## Battery chemistries head-to-head There is no universally best chemistry. There is a best chemistry *for a duty cycle*. The headline number everyone quotes — gravimetric energy density in Wh/kg — matters enormously for a flying or legged robot and barely at all for a 200 kg AGV where the battery is also useful ballast. Here is the practical comparison. Numbers are typical commercial-cell figures as of 2026, not laboratory bests. | Chemistry | Energy density (Wh/kg) | Energy density (Wh/L) | Nominal cell V | Cycle life (80% DoD) | Continuous C-rate | Safety | Relative cost ($/kWh) | Usable temp (discharge) | |---|---|---|---|---|---|---|---|---| | **Li-ion NMC** (18650/21700) | 200–270 | 550–730 | 3.6–3.7 | 500–1,500 | 1–10C | Moderate (flammable electrolyte) | 110–180 | −20 to +60 °C | | **LiFePO4 (LFP)** | 90–160 | 220–350 | 3.2 | 2,000–6,000 | 1–5C (some 10C) | High (no thermal runaway below ~270 °C) | 90–150 | −20 to +60 °C | | **LiPo (pouch, NMC/LCO)** | 150–250 | 300–550 | 3.7 | 200–500 | 10–100C+ | Low (mechanically fragile, swells) | 150–300 | −10 to +60 °C | | **NiMH** | 60–120 | 140–300 | 1.2 | 500–1,000 | 1–5C | High (aqueous, non-flammable) | 250–400 | −20 to +50 °C | | **Lead-acid (AGM)** | 30–50 | 60–110 | 2.0 | 200–500 | 0.2–1C (high peak) | High (but H₂ venting, acid) | 100–150 | −20 to +50 °C | ### Li-ion NMC: the energy-density default NMC (nickel-manganese-cobalt) is the chemistry behind almost every weight-critical robot. At 200–270 Wh/kg it stores more per kilogram than anything else you can buy in volume, which is why humanoids, quadrupeds, and long-endurance drones live on it. The cost is moderate cycle life (500–1,500 cycles to 80% capacity) and a flammable electrolyte that will sustain thermal runaway if abused. Real cells: **Molicel INR21700-P42A** (4200 mAh, 45 A continuous), **Samsung INR21700-50S** (5000 mAh, 25 A), **LG INR21700-M50LT** (5000 mAh, high energy, lower current). The P42A has become the default high-power 21700 for robotics because it balances ~210 Wh/kg with a genuine 45 A continuous rating. ### LiFePO4: the cycle-life and safety default LFP trades roughly 40% of NMC's energy density for two things robotics fleets care about deeply: cycle life and safety. A good LFP cell does 3,000+ cycles to 80% and will not thermally run away under normal abuse — its olivine cathode is chemically stable and does not release oxygen the way NMC does. Its discharge curve is also famously flat (≈3.2 V across most of the SoC range), which is great for the bus and terrible for SoC estimation (more on that under [BMS](#bms)). For an AMR or AGV that cycles once or twice a day for five years — that is 1,800–3,600 cycles — LFP is the obvious choice. NMC would be worn out; LFP is barely warmed up. Real cells: **CATL / EVE / Lishen prismatic LFP** (100–300 Ah prismatic cells dominate stationary and AGV packs), and cylindrical LFP in 32700 format for smaller robots. ### LiPo pouch: the high-C-rate specialist LiPo (a packaging format more than a distinct chemistry — usually NMC or LCO in a foil pouch) exists for one reason: extreme C-rate. A 10C–100C+ discharge means a small, light pack can dump enormous instantaneous current, which is exactly what a racing drone or a combat robot needs. The price is fragility: no rigid can to resist puncture, a tendency to swell when abused or aged, and a hard requirement for careful charging and storage. In a serious robot, LiPo shows up where power density (W/kg) matters more than energy density (Wh/kg) and where you accept a maintenance and safety burden. ### NiMH and lead-acid: legacy and cost-floor NiMH survives in low-cost consumer robots and where lithium's transport/safety overhead is unwanted. It is robust and non-flammable but heavy and self-discharges. Lead-acid persists only where weight is genuinely free (large AGVs as ballast, scrubbers) or where the cost floor and field-replaceability dominate. At 30–50 Wh/kg it is a non-starter for anything mobile and weight-sensitive, and its usable depth of discharge is shallow (50–60%) to preserve life. > **Safety rule**: Never mix chemistries, cell ages, or capacities within a series string. Cells in series must be matched — the weakest cell defines when the whole string hits cutoff, and an imbalanced string is exactly how you over-discharge or overcharge an individual cell into failure. ## Cells, packs & configuration A pack is cells wired in **series (S)** to set voltage and **parallel (P)** to set capacity and current. The shorthand is `nSmP`: a `13S4P` pack is 13 cells in series, 4 such strings in parallel. ### Series sets voltage, parallel sets capacity Each series cell adds its nominal voltage. NMC at 3.6 V nominal: a 13S pack is `13 × 3.6 = 46.8 V` nominal, with a 4.2 V/cell full charge giving `13 × 4.2 = 54.6 V` and a 3.0 V/cell floor giving `13 × 3.0 = 39 V`. This is the ubiquitous "48 V" robot/e-bike pack. LFP at 3.2 V nominal needs more cells for the same voltage: 16S LFP is `16 × 3.2 = 51.2 V` nominal — the standard "48 V" LFP configuration. Each parallel string adds capacity and current capability. Four 5,000 mAh cells in parallel give 20,000 mAh (20 Ah) and four times the current rating. Capacity in **amp-hours (Ah)** times pack voltage gives energy in **watt-hours (Wh)**, the number that actually matters: ``` Energy: Wh = V_nominal × Ah Example: 13S4P of Samsung 50S (5.0 Ah, 3.6 V nominal) V_nominal = 13 × 3.6 = 46.8 V Capacity = 4 × 5.0 = 20 Ah Energy = 46.8 × 20 = 936 Wh Mass = 52 cells × 0.0685 kg ≈ 3.56 kg (cells only) Density = 936 / 3.56 ≈ 263 Wh/kg (cells only; pack-level ~75–85% of this after BMS, busbars, case) ``` ### Cell formats: 18650 vs 21700 vs pouch The two dominant cylindrical formats are **18650** (18 mm × 65 mm, 2.5–3.5 Ah, the legacy workhorse) and **21700** (21 mm × 70 mm, 4.0–5.0 Ah, the modern default). The 21700 packs more energy per cell and has better thermal mass and a higher current ceiling, which is why new designs default to it. Pouch cells trade the rigid can for packaging flexibility and the highest C-rates, at the cost of needing external mechanical support and swell management. | Format | Typical capacity | Typical max continuous current | Mass | Energy/cell | Best for | |---|---|---|---|---|---| | 18650 | 2.5–3.5 Ah | 10–30 A | ~45–48 g | 9–12 Wh | Legacy packs, compact robots | | 21700 | 4.0–5.0 Ah | 25–45 A | ~68–70 g | 14–18 Wh | Modern default, high-power mobile | | Pouch (LiPo) | 1–20 Ah | 10C–100C | varies | varies | Drones, combat, high-C transients | | Prismatic LFP | 50–300 Ah | 1–5C | 1–6 kg | 160–960 Wh | AGVs, stationary, large AMRs | ### Voltages you must track Four voltages matter per cell, and confusing them is how packs die: - **Nominal**: the rated average (3.6–3.7 V NMC, 3.2 V LFP). Used for labeling and energy math. - **Charge (max)**: 4.2 V/cell NMC, 3.65 V/cell LFP. Never exceed — overcharge is a runaway path. - **Cutoff (min)**: 3.0 V/cell NMC (2.5 V absolute), 2.5 V/cell LFP. Going below damages the cell; deep over-discharge can plate copper and create internal shorts. - **Storage**: ~3.7–3.8 V/cell NMC (≈50–60% SoC) for minimum calendar aging. ### C-rate C-rate normalizes current to capacity. **1C** is the current that discharges the pack in one hour: for a 20 Ah pack, 1C = 20 A. A "2C continuous, 5C peak" cell in a 20 Ah pack can sustain 40 A and burst to 100 A. C-rate is how you translate a cell datasheet into whether your pack can deliver your peak current — and it is the single most common sizing mistake, because designers size for energy (Wh) and forget to check that the same pack can deliver the peak amps. ## Battery Management Systems (BMS) A multi-cell lithium pack without a BMS is not a product; it is an incident report waiting to be filed. The BMS is the embedded system that monitors every series cell, enforces the safe operating envelope, and (on good ones) reports state over a bus. ### Cell balancing Series cells drift. Manufacturing tolerance, temperature gradients across the pack, and differing self-discharge mean that after a few cycles the cells are at slightly different SoC. Because the *weakest* cell hits cutoff first on discharge and the *strongest* hits full first on charge, an unbalanced pack loses usable capacity and risks driving individual cells outside their limits. Balancing fixes this: - **Passive balancing** bleeds charge off the highest cells through a resistor during charge. Cheap, simple, wasteful (the energy becomes heat), and slow (typically 50–200 mA of balance current). Fine for most robots. - **Active balancing** shuttles charge from high cells to low cells (capacitor or inductor based). Efficient and faster, more expensive, found in high-end and large packs. Worth it on big LFP packs where passive balancing would take days. ### Protection: the non-negotiables Every BMS must enforce, in hardware where it counts: - **Overvoltage (OV)** per cell — stops charge when any cell hits the ceiling. - **Undervoltage (UV)** per cell — disconnects load before any cell drops below the floor. - **Overcurrent (OC)** charge and discharge — trips on sustained over-limit current. - **Short-circuit** — a fast (microsecond-to-millisecond) hardware trip independent of the slower OC. - **Over/under-temperature** — disables charge below 0 °C (charging a cold lithium cell plates lithium metal — a runaway path) and disables everything above the cell's limit. > **Safety rule**: Charging lithium cells below 0 °C causes lithium plating and permanent damage with an internal-short risk. A BMS must block sub-freezing charge, or the pack must be heated before charging. This is a hard rule, not a guideline. ### SoC and SoH estimation - **State of Charge (SoC)** — how full, 0–100%. The simplest method is **coulomb counting** (integrate current in/out), which drifts and needs periodic recalibration at the voltage endpoints. Better BMSs fuse coulomb counting with a voltage/OCV model and, increasingly, a Kalman filter. **LFP's flat discharge curve makes voltage-based SoC nearly useless in the 20–80% band** — there is barely 0.1 V of slope across 60% of capacity — so coulomb counting carries the load for LFP, which is its real practical drawback. - **State of Health (SoH)** — capacity and internal-resistance degradation versus new. Tracked over many cycles; it tells you when a fleet pack is due for retirement (typically at 70–80% of original capacity). ### Communication A "dumb" BMS just protects and disconnects. A "smart" BMS reports cell voltages, temperatures, current, SoC/SoH, and fault state over a bus, and accepts charge/discharge enable commands. The common interfaces: - **CAN bus** — the robotics and automotive standard, deterministic, robust, integrates cleanly with motor controllers and the vehicle controller. **Orion BMS 2 / Jr** and **Daly smart BMS** with CAN are common. - **SMBus** — the laptop/portable heritage, found in smart-battery packs. - **RS-485 / UART / Bluetooth** — common on budget BMSs (Daly, JBD/JK) for configuration and telemetry. For a robot that needs to fold pack state into its health monitoring and behave deterministically, a CAN BMS (Orion-class, or an automotive-derived unit) is worth the premium. For a tool or a simple AMR, a Daly/JBD smart BMS over UART/Bluetooth is fine. The BMS interacts with your control system, so its reporting latency and failure behavior matter — see [real-time control systems](/posts/real-time-control-systems-ultimate-guide/) for why a BMS that drops off the bus or trips silently can wreck a control loop. ## Sizing a battery from the load Now the worked method. You size from two independent constraints — energy for runtime, current for peak — and the pack must satisfy both. Then you check voltage sag, then you account for the weight spiral. ### Step 1 — Energy for runtime Start from average power and target runtime: ``` Required usable energy: E_usable = P_avg × t_runtime Example: P_avg = 150 W, t_runtime = 4 h E_usable = 150 × 4 = 600 Wh Derate to nameplate (don't use the whole pack): - Depth of discharge limit (preserve life): use 80% → ÷ 0.80 - End-of-life capacity (design for aged pack): ÷ 0.80 - DC-DC + wiring efficiency: ÷ 0.90 E_nameplate = 600 / (0.80 × 0.80 × 0.90) ≈ 1,042 Wh ``` So a 600 Wh task needs a ~1,040 Wh nameplate pack if you want it to still hit runtime when the pack is aged and you are protecting cycle life. Designers who size the nameplate to the task and skip the deratings get a robot that meets spec for three months and then quietly stops finishing its shift. ### Step 2 — Peak current from actuator stall Independently, find the worst-case instantaneous current. Sum the peak draws of everything that can peak simultaneously: ``` Peak bus current: 2 drive axes @ 60 A stall each = 120 A Compute + sensors (continuous) = ~6 A Worst-case simultaneous peak ≈ 126 A on a 48 V bus Check against pack C-rate: Pack = 13S4P of Molicel P42A (16.8 Ah, 45 A/cell continuous, parallel ×4 = 180 A continuous) 126 A < 180 A continuous → OK on cells But verify the BMS continuous + peak rating covers 126 A! ``` The pack's deliverable current is `cells_in_parallel × per_cell_current_limit`. A 4P arrangement of 45 A cells gives 180 A continuous — comfortably above the 126 A peak. If your energy-sized pack happens to be only 2P, you would have 90 A continuous and your 126 A peak would force the cells past their limit, sag the bus, and likely trip the BMS. **This is exactly the case where the current constraint forces a bigger pack than the energy constraint asked for** — and you take the larger of the two. ### Step 3 — Voltage sag Every cell and conductor has internal resistance. Under a current pulse, the bus voltage drops by `I × R_internal`: ``` Voltage sag: V_load = V_oc − I_peak × R_total Pack internal R (13S4P): per-cell ≈ 15 mΩ Series adds: 13 × 15 mΩ = 195 mΩ Parallel ÷ 4: 195 / 4 = 48.75 mΩ Plus wiring + connectors: ~10 mΩ R_total ≈ 59 mΩ At I_peak = 126 A: Sag = 126 × 0.059 ≈ 7.4 V V_load = 46.8 − 7.4 = 39.4 V (and lower if pack is near-empty) ``` A 7.4 V sag on a 48 V bus is survivable — but check it against the undervoltage lockout (UVLO) of your DC-DC converters and motor controllers. If your logic-rail DC-DC has a 36 V minimum input and your pack is at 80% SoC (≈44 V open-circuit) when this peak hits, you are at `44 − 7.4 = 36.6 V` — uncomfortably close. Near end-of-discharge (39 V open-circuit) the same peak drops you to 31.6 V and your logic rail browns out, your controller resets, and your robot drops mid-motion. **This is the single most common silent failure mode in mobile robots**, and it is invisible on a multimeter because it only happens during the transient. The fixes, in order of preference: lower internal resistance (more parallel cells, fatter wire, better connectors), bulk capacitance on the bus to ride through the pulse, a separate non-sagging source for logic (a small DC-DC fed before the sag point, or a dedicated logic battery), or a higher bus voltage so the same power needs less current. ### Step 4 — The weight spiral On legged, flying, and arm robots, the battery you add to extend runtime adds mass, which raises the power needed to move, which shortens runtime, which tempts you to add more battery. This is the **weight spiral**, and it has a hard limit set by your chemistry's energy density: past a point, adding battery *reduces* runtime. The escape is higher energy density (NMC over LFP), lighter structure, or accepting the runtime. Ground robots that roll mostly dodge this — rolling resistance is low and battery mass is nearly free, which is why AGVs cheerfully carry heavy LFP. ## Power distribution architecture Once you have a pack, you have to get its energy to the loads safely. This is the **DC bus** and its distribution: the busbars, the fuses, the e-stop, the precharge, and the connectors. ### The bus voltage choice Choosing the bus voltage is one of the highest-leverage decisions in the whole design, because power is `P = V × I`. For a fixed power, doubling voltage halves current, which quarters I²R loss and lets you use thinner, lighter, cheaper conductors and smaller connectors. | Bus voltage | Typical robot class | Pros | Cons | |---|---|---|---| | **12 V** | Small rovers, hobby, sensors | Ubiquitous parts, safe, simple | High current for any real power; big I²R loss | | **24 V** | Light AMRs, small arms, cobots | Common industrial standard, safe (SELV), wide part support | Still high current at multi-kW; thick cables | | **36 V** | Mid e-mobility, mid robots | Good middle ground | Less standard part ecosystem than 24/48 | | **48 V** | Medium AMRs, humanoids, quads | Below 60 V SELV ceiling, low current, dense, efficient | More series cells, precharge needed | | **>60 V (HV)** | Large AGVs, heavy arms, vehicles | Lowest current, highest power density | Crosses into hazardous-voltage territory; isolation, certified components, safety interlocks | The modern sweet spot for medium robots is **48 V**: it sits just under the 60 V DC ceiling that most safety standards treat as the boundary of Safety Extra-Low Voltage (SELV), so you avoid the heavy regulatory and isolation burden of true high-voltage systems while getting most of the efficiency benefit. A 5 kW robot draws `5000/48 ≈ 104 A` on a 48 V bus versus `5000/24 ≈ 208 A` on 24 V — the difference between 8 AWG and 4 AWG cable, and between a 120 A connector and a 250 A one. ### Busbars and conductors For high-current distribution, **copper busbars** beat cables: lower resistance, better heat dissipation, mechanical rigidity, and clean fanout to multiple loads. Size conductors for both the continuous RMS current (heating) and the peak (voltage drop). A rule of thumb for copper: ~4–6 A/mm² continuous in free air with insulation, derated in bundles or enclosures. ### Fusing Every source and major branch needs overcurrent protection. The fuse protects the *wiring* primarily (so a fault can't start a fire) and the source secondarily. Size the fuse above the legitimate peak current and below the conductor's and connector's rating: ``` Fuse sizing: I_continuous_max = 28 A (RMS, from duty cycle) I_peak = 126 A for ~0.5 s Choose a fuse whose time-current curve passes 126 A for 0.5 s but opens on a sustained fault well below the wire rating. e.g. a 100–125 A slow-blow / time-delay fuse on a circuit rated for 150 A continuous wiring. ``` Use slow-blow / time-delay fuses on motor branches (they must survive the inrush and stall transients) and fast fuses where you want quick fault isolation. Class-T or ANL fuses are common for high-current DC robot buses. ### E-stop, contactors, and the cut path A robot needs a way to remove power *now*. The e-stop chain typically drives a **contactor** (a high-current relay) that disconnects the motor bus, while ideally leaving compute powered so the robot can log the event and brake controllably. On legged and high-energy machines, cutting motor power abruptly can be more dangerous than a controlled stop, so the e-stop often commands a fast controlled brake *and* drops the contactor. Anderson Powerpole / SB connectors are the de facto standard for the high-current disconnect and battery interface in this class of robot. ### Precharge and inrush Motor controllers have large bulk capacitors on their DC input — hundreds of microfarads to millifarads. Connect a discharged capacitor bank directly across a battery and you get an inrush current limited only by parasitic resistance: hundreds to thousands of amps for a few milliseconds. That inrush welds contactor contacts, blows fuses, trips BMS overcurrent, and pits connectors. The fix is a **precharge circuit**: a resistor (and a smaller relay or MOSFET) in parallel with the main contactor that charges the bus capacitance gently before the main contactor closes. ``` Precharge: C_bus = 4,700 µF, V_bus = 48 V Precharge resistor R = 22 Ω Initial precharge current = 48 / 22 ≈ 2.2 A (safe) Time constant τ = R·C = 22 × 0.0047 = 0.103 s Wait ~5τ ≈ 0.5 s, bus reaches ~99% of V, then close main contactor. ``` > **Safety rule**: Above a few hundred microfarads of bus capacitance on a 24 V+ bus, treat precharge as mandatory. Hot-plugging a high-voltage pack into uncharged bulk capacitance is how you weld a contactor closed — which then *cannot* open on the next e-stop. ## DC-DC conversion & regulation The pack gives you one sagging, noisy voltage. The robot needs several clean ones: 5 V and 3.3 V for logic, 12 V for sensors and fans, maybe 19–24 V for a compute module, and a high-current motor rail. **DC-DC converters** make those rails. ### Buck, boost, and buck-boost - **Buck (step-down)** — the workhorse. Efficiently drops the bus to a lower rail (48 V → 12 V, 24 V → 5 V). Efficiencies of 90–97% are routine. - **Boost (step-up)** — raises voltage; used when a rail must exceed the (sagging) bus or to stabilize a falling pack voltage. - **Buck-boost / SEPIC** — maintains a regulated output whether the input is above or below it. Useful for a logic rail that must stay at 12 V even as a 3S pack sags from 12.6 V to 9 V across discharge. ### Point-of-load and rail topology Modern practice is **point-of-load (POL)** regulation: distribute one fairly high intermediate voltage (the bus or a 12/24 V intermediate) and place small, efficient buck converters right next to each load that needs a specific rail. This minimizes the current in the distribution wiring and keeps high-current low-voltage runs short. TI (the TPS family) and Vicor (their bricks and ChiP/PI modules) are common silicon and module vendors; Vicor in particular is favored where power density and isolation matter at high power. ### Isolation An **isolated** DC-DC has no electrical connection between input and output (a transformer in between). You want isolation when you need to break ground loops, when one side is a hazardous voltage and the other is touch-safe, or when noise on the motor ground must not couple into sensitive analog/logic ground. Non-isolated buck/boost is cheaper and more efficient and is fine when input and output share a ground reference safely. ### Keep logic and motor rails apart This is the rule that prevents the most field failures: **the logic/compute rail must not brown out when the motors transient.** A motor acceleration sags the bus (we computed 7.4 V earlier); if your compute board draws straight from that bus through a converter with a high UVLO, the sag resets your controller mid-motion. Defend against it with: - A DC-DC for logic with a low UVLO and enough input bulk capacitance to ride through the sag. - A separate feed for logic taken upstream of the worst sag, or even a small dedicated logic battery / supercap holdup. - Explicit **brownout protection**: a supervisor that detects the rail dipping and either asserts a clean reset or, better, holds the rail up through a holdup cap long enough for the transient to pass. A real-time controller that resets because a motor sneezed is unacceptable — see [real-time control systems](/posts/real-time-control-systems-ultimate-guide/) and the [motor controllers & FOC guide](/posts/motor-controllers-foc-ultimate-guide/) for why the drive's DC-link behavior and the controller's power integrity are tightly coupled. ## Regeneration & braking When a motor decelerates — slowing a drive wheel, lowering an arm against gravity, a leg absorbing impact — it acts as a generator. The kinetic or potential energy has to go somewhere. Where it goes is a design decision with safety consequences. ### Regen into the pack A motor controller doing field-oriented control can run current *backward*, pushing energy from the decelerating motor into the DC bus, where the battery absorbs it as charge. This is **regenerative braking**, and it recovers real energy — on a heavy AMR doing frequent stops, regen can return 5–20% of drive energy. See the [motor controllers & FOC guide](/posts/motor-controllers-foc-ultimate-guide/) for how the drive sources reverse current. The catch: the pack must be *able* to accept the charge. Two limits bite: - **SoC headroom** — a full pack cannot absorb more charge. Regen into a 100% pack pushes the bus voltage up until the BMS trips on overvoltage (or worse). - **Charge-current limit** — cells accept charge more slowly than they deliver discharge, and lithium charge current must be limited (especially when cold). The BMS enforces a charge-current ceiling; exceed it and you trip. ``` Regen energy from a stop: AMR mass m = 120 kg, decelerating from v = 1.5 m/s to 0 KE = ½·m·v² = 0.5 × 120 × 1.5² = 135 J Over a 0.5 s stop → P_regen ≈ 270 W into the bus On a 48 V bus → ~5.6 A of regen current the pack must accept (or burn). ``` ### The brake resistor When the pack cannot or should not absorb the regen energy, it must be burned as heat in a **brake (dump) resistor** switched across the bus by a "brake chopper" — a transistor that PWMs the resistor to clamp the bus voltage at a set ceiling (say 56 V on a 48 V system). This is mandatory on: - High-inertia or gravity-loaded axes (an arm that can be back-driven, a hoist). - Systems that may regen into a full or cold pack. - Any drive where uncontrolled bus voltage rise could exceed component ratings. > **Safety rule**: If a fault disconnects the pack (e-stop, BMS trip, blown fuse) while a motor is still spinning and regenerating, the regen energy has nowhere to go and the bus voltage spikes — potentially destroying the controller. A brake resistor on the controller side of the disconnect, or a controller that detects bus overvoltage and stops actively braking, is how you survive this. ## Charging A robot that charges badly has poor uptime regardless of pack size. Charging strategy is as much a part of the power system as the pack. ### CC/CV is the lithium charge profile All lithium chemistries charge with **constant current, then constant voltage (CC/CV)**: 1. **CC phase** — charge at a constant current (e.g. 0.5C) until cell voltage reaches the max (4.2 V NMC / 3.65 V LFP). This is most of the energy and most of the time. 2. **CV phase** — hold the voltage at the max and let current taper. Charge is complete when current falls to ~0.05C (the termination current). Faster charging means higher CC current, which means more heat and more stress; **1C charging is aggressive but common, 0.5C is gentle, and anything above 1–2C demands active cooling and shortens life.** Charging below 0 °C is forbidden without heating. ### Opportunity charging and uptime For an AMR fleet, the metric that matters is *availability*, and the lever is **opportunity charging**: instead of a long charge after a full discharge, the robot tops up briefly and frequently at a dock between tasks. A robot that grabs 10 minutes of 1C charge every hour can sustain a 24/7 duty cycle on a pack far smaller than one sized for a full shift. This is why AMR fleets favor LFP — its cycle life shrugs off the thousands of partial cycles that opportunity charging creates, where NMC would wear out. See the [mobile robots (AMR/AGV) guide](/posts/mobile-robots-amr-agv-ultimate-guide/) for how charging strategy shapes fleet sizing and throughput. ### Contact vs inductive - **Contact charging** — exposed contacts (often spring pins or a blade) mate with a dock. Cheap, efficient (>95%), high current, and the dominant method for AMRs. Needs reliable alignment and contact cleaning; arcing and wear are the failure modes. - **Inductive (wireless)** charging — no exposed contacts, so no wear, arcing, or ingress path. Lower efficiency (85–92%), lower power density, more expensive, and adds an air gap that complicates alignment. Worth it in wet, dusty, or hygienic (food, pharma) environments where exposed contacts are a liability. ### Hot-swap and docking For robots that cannot afford charge downtime at all, **hot-swap** packs let an operator (or a robot) exchange a depleted pack for a charged one in seconds. This needs: a connector rated for blind-mate and inrush, a way to keep the robot's logic alive during the swap (a small holdup battery or supercap), and packs with onboard BMS so each pack is independently safe. Auto-docking — the robot drives itself onto a charger — is what makes fleets truly autonomous; the docking mechanism's reliability sets the fleet's effective uptime. ## Safety: thermal runaway, protection, transport Lithium packs store a lot of energy in a small, flammable package. Respect that and the failure modes are manageable; ignore it and you get a fire that you cannot put out. ### Thermal runaway **Thermal runaway** is the worst case: a cell's internal temperature rises (from overcharge, internal short, external heat, or mechanical damage) to where exothermic reactions become self-sustaining, generating more heat than can be dissipated. The cell vents, then ignites, and the heat can propagate cell-to-cell through the pack — **propagation** — turning one bad cell into a pack fire. Key facts: - NMC runs away around 150–210 °C and releases oxygen from its cathode, so it sustains its own combustion. LFP is far more stable (onset ~270 °C, little oxygen release) — this is LFP's headline safety advantage. - Once started, a lithium fire is largely self-oxidizing; you cannot smother it. The strategy is **prevent** (don't abuse), **contain** (cell spacing, intumescent barriers, steel partitions to stop propagation), and **vent** (let gas and heat escape away from people and electronics). > **Safety rule**: Design the pack to *contain* a single-cell failure without propagating to neighbors — physical spacing, thermal barriers between cells, and a vent path. Assume one cell *will* fail eventually; the design question is whether it takes the pack and the robot with it. ### What kills packs (and how to not do it) In rough order of how often they kill robot packs: - **Over-discharge** — running a cell below its floor. Causes copper dissolution and internal shorts. Prevented by BMS UV cutoff and by not designing the robot to run the pack to empty. - **Heat** — the slow killer. Every ~10 °C above 25 °C roughly halves calendar life. A pack that lives at 45 °C ages ~4× faster than one at 25 °C. Cool the pack, and never charge a hot pack. - **Overcharge** — exceeding the per-cell max. A direct runaway path; prevented by BMS OV cutoff and a charger that respects CV termination. - **Cold charging** — charging below 0 °C plates lithium metal, permanently reducing capacity and creating internal-short risk. - **Mechanical/electrical abuse** — puncture, crush, external short. Prevented by mechanical protection, fusing, and short-circuit-rated BMS. ### Fusing and protection layers Defense in depth: cell-level (BMS short-circuit and OC trip), pack-level (a main fuse sized to the wiring), and branch-level (per-branch fuses so one shorted load doesn't take the whole bus down). The BMS protects the cells; the fuse protects the wiring; the contactor provides the commanded disconnect. None of the three replaces the others. ### Transport and UN 38.3 Shipping lithium cells and packs is legally regulated. **UN 38.3** is the United Nations testing standard (altitude, thermal cycling, vibration, shock, external short, impact, overcharge, forced discharge) that every lithium cell and pack must pass to be shipped by air, sea, or road. There are also state-of-charge limits for air freight (typically ≤30% SoC for standalone packs) and packaging/labeling requirements. This is not optional and it constrains how you ship spares, returns, and product — design for it early, because a pack that can't pass UN 38.3 is a pack you can't sell or service across borders. ## Tethered & alternative power Not every robot should carry its energy. Sometimes the right power system is no battery at all, or a battery plus something else. ### AC mains for fixed arms A stationary industrial arm bolted to a floor has no reason to carry a battery. It runs off **AC mains**, rectified to a DC bus that feeds the servo drives. This gives effectively unlimited energy, no weight penalty, and no charging logistics. The trade is a tether (the power cable) and the need for mains-grade safety, isolation, and possibly three-phase input for big arms. See the [industrial robot arms guide](/posts/industrial-robot-arms-ultimate-guide/) for how mains-fed servo drives and their shared DC bus (with regen sharing between axes) are architected. ### Power-over-tether A tethered mobile robot (inspection crawlers, ROVs, some drones) can take power down a cable instead of carrying a battery. Sending power at high voltage down a thin tether and stepping it down at the robot minimizes tether weight and I²R loss — the same `P = V·I` logic as the bus-voltage choice, applied to a long thin conductor. The tether buys unlimited runtime at the cost of range, snag risk, and the mass/drag of the cable itself. For an ROV at depth or a drone that must loiter for hours, the tether wins decisively. ### Fuel cells and range extenders **Hydrogen fuel cells** offer high energy density by mass (the fuel is light) and fast refueling, which is attractive for long-endurance outdoor robots and some heavy AGVs. The catch is system complexity, cost, hydrogen logistics, and poor transient response — a fuel cell can't follow a fast load step, so it is always paired with a buffer battery or supercap that handles the peaks while the fuel cell supplies the average. That hybrid (fuel cell for average power, battery/supercap for peaks) is the practical architecture. ### Supercapacitors **Supercapacitors** store little energy (5–10 Wh/kg) but deliver and absorb enormous power (thousands of W/kg) over hundreds of thousands of cycles with no chemical wear. In a robot they shine as a **peak buffer**: parallel a supercap bank across the bus and it sources the brief stall/acceleration peaks and absorbs regen spikes, letting a smaller, lower-C battery handle the average. This decouples the energy sizing from the power sizing — exactly the tension we opened with — and is increasingly common on legged robots whose peak-to-average ratio is brutal. ## Selecting & integrating a power system Pull it together with an ordered method and a worked example. ### The selection order 1. **Write the power budget** — average and peak watts for every load, plus duty cycle. (Section 2.) 2. **Pick the bus voltage** — from peak power and the SELV/efficiency tradeoff. 24 V light, 48 V medium, HV only when forced. (Section 7.) 3. **Pick the chemistry** — from duty cycle and weight sensitivity: NMC for weight-critical and moderate cycling, LFP for high-cycle/safety-critical fleets, LiPo only for extreme C-rate. (Section 3.) 4. **Size the pack** — take the *larger* of the energy-driven and current-driven sizes; check voltage sag against every UVLO. (Section 6.) 5. **Spec the BMS** — cell count, continuous + peak current, balancing type, comms (CAN for integrated robots). (Section 5.) 6. **Design distribution** — busbars, fusing, e-stop/contactor, precharge, connectors. (Section 7.) 7. **Spec DC-DC and rail isolation** — keep logic brownout-proof and separate from the motor bus. (Section 8.) 8. **Handle regen and charging** — brake resistor if needed; charge profile and docking strategy. (Sections 9–10.) 9. **Close the safety loop** — propagation containment, fusing layers, UN 38.3 for transport. (Section 11.) ### Worked example: a 4-hour, 150 W AMR with peaky drive Bringing the earlier numbers together into one decision: ``` LOADS Average power: 150 W (compute 60 W, sensors 20 W, drive avg 70 W) Peak power: ~6 kW (two drive axes @ 60 A on 48 V during accel) Target runtime: 4 h Duty cycle: drive at peak <5% of time Weight: ground robot, rolling — battery mass nearly free BUS VOLTAGE 6 kW peak → 24 V would mean 250 A (impractical cabling). Choose 48 V → 125 A peak, 8 AWG-class cabling, SB175 connectors. CHEMISTRY Daily cycling for years, opportunity-charged, safety-sensitive site. → LiFePO4. Cycle life and safety beat NMC's density here. PACK SIZING Energy: E_usable = 150 × 4 = 600 Wh E_nameplate = 600 / (0.85 DoD × 0.85 EoL × 0.90 conv) ≈ 922 Wh (LFP tolerates deeper DoD, so 85% used.) Current: peak 125 A; choose cells/parallel so continuous ≥ 125 A. Config: 16S (51.2 V nominal LFP) ; pick a cell + P count giving ~1,000 Wh and ≥125 A continuous → e.g. 16S2P of 100 Ah-class? Too much energy. Better: 16S of a high-power 32700 LFP (6 Ah, 18 A) → need 7P for 126 A → 16S7P, ~575 Wh — too little. RESOLUTION: current constraint (need 126 A) drives a wider pack than the 922 Wh energy constraint. Either accept a larger pack (~1.3 kWh to hit both), or add a SUPERCAP buffer to cover the <5% peak and size cells for the 28 A RMS — then a 16S3P of 32700 (~860 Wh, 54 A cont.) covers RMS, and the supercap covers the 126 A bursts. The supercap route saves ~40% pack mass/cost here. VOLTAGE SAG (with supercap on bus): negligible during burst — supercap sources it. Without supercap, check 16S LFP sag at 126 A. BMS: 16S LFP, CAN-reporting (Orion Jr or Daly CAN), ≥60 A cont / 150 A peak rating, passive balance, temp sensors, <0 °C charge inhibit + heater enable. DISTRIBUTION: 16S LFP main contactor + 100 A class-T fuse + 22 Ω precharge; SB175 battery connector; busbar fanout; e-stop drops contactor + commands controlled brake. DC-DC: 48 V → 19 V (compute, isolated, low UVLO + holdup cap), 48 V → 12 V (sensors/fans), 48 V → 5 V (logic). Logic fed with brownout supervisor; supercap holds bus during peaks. REGEN: Light (rolling robot). Pack accepts most regen; brake chopper clamps bus at 58 V as backstop for full/cold pack. CHARGING: Contact dock, 0.5C opportunity charge, auto-docking. LFP cycle life absorbs the partial cycles. ``` That example shows the central tension in action: the **energy** constraint wanted ~922 Wh, but the **peak-current** constraint wanted enough parallel cells to deliver 126 A — and reconciling them either inflates the pack or argues for a supercap buffer. There is no spec sheet that resolves this for you; it falls out of the budget. ### Final comparison: matching architecture to robot class | Robot class | Chemistry | Bus | Notable power-system feature | |---|---|---|---| | **Drone / UAV** | LiPo (high C) | 22–52 V | Power density rules; minimal protection mass; tight thermal/charge discipline | | **AMR / AGV** | LiFePO4 | 24–48 V | Opportunity/auto-charging; long cycle life; regen on stops; fleet uptime focus | | **Humanoid / quadruped** | Li-ion NMC | 48 V | Energy density rules; supercap peak buffer; brutal peak/average; weight spiral | | **Industrial arm (fixed)** | None (AC mains) | rectified DC bus | Tethered; shared DC bus + regen sharing across axes; mains safety | | **Combat / racing** | LiPo (very high C) | 22–48 V | Extreme peak current; minimal mass; accepted abuse and short life | | **Inspection crawler / ROV** | Tether or small Li-ion | varies | Power-over-tether for unlimited runtime; HV down a thin tether | Match the architecture to the class first, then refine with the worked numbers. The robots that fail in the field are almost never the ones that did this budgeting; they are the ones that picked a pack by its Wh rating and discovered the peak-current and voltage-sag truths the hard way. ## Frequently asked questions **Why is my robot resetting only when the motors accelerate hard?** Voltage sag. The motor transient pulls a large current, the pack's internal resistance drops the bus voltage (`V = V_oc − I·R`), and that dip crosses the undervoltage lockout of the DC-DC feeding your logic/compute. It is invisible on a multimeter because it lasts milliseconds. Fix it with lower pack/wiring resistance, bus bulk capacitance, a logic rail with a low UVLO plus holdup capacitance, or a separate non-sagging logic supply. See [Sizing](#sizing) and [DC-DC](#dcdc). **LiFePO4 or Li-ion NMC for my mobile robot?** If it cycles daily for years and weight is not critical (a rolling AMR/AGV), choose **LFP** for its 2,000–6,000-cycle life and superior safety. If weight is critical (legged, flying, humanoid) and the pack won't see thousands of cycles, choose **NMC** for its 200–270 Wh/kg. The deciding axes are cycle count and weight sensitivity, not the headline energy density alone. **Do I really need a BMS, or can I just charge carefully?** You need a BMS on any multi-cell lithium pack. Even with perfect charging, series cells drift in SoC, and the BMS is what balances them and what enforces per-cell over/under-voltage, overcurrent, and temperature limits. Careful charging cannot prevent one cell going out of bounds inside a string. A pack without per-cell monitoring is a fire risk, full stop. **What bus voltage should I use?** Driven by peak power. Under ~1 kW peak, 24 V is fine and keeps you safely in SELV territory. From ~1–6 kW, **48 V** is the sweet spot — below the 60 V hazardous-voltage line, low enough current for reasonable cabling. Above that, you may be forced into true high voltage (>60 V) with its isolation and certification burden. Higher voltage = lower current = thinner cable and lower I²R loss for the same power. **How do I size the fuse?** Above the legitimate peak current (so it doesn't nuisance-trip on a stall) and below the wiring/connector continuous rating (so it protects the wire). Use the fuse's time-current curve: it must pass your peak (e.g. 126 A for 0.5 s) but open on a sustained fault below the cable rating. Slow-blow/time-delay (Class-T, ANL) fuses on motor branches that see inrush. See [Distribution](#distribution). **What is precharge and when do I need it?** A resistor (with a small relay/MOSFET) that gently charges the motor controller's bulk capacitance before the main contactor closes, limiting inrush. Without it, connecting a discharged capacitor bank across the pack draws hundreds of amps for milliseconds, which welds contactors, blows fuses, and trips the BMS. Treat it as mandatory above a few hundred microfarads of bus capacitance on a 24 V+ bus. **Where does regenerative braking energy go, and is it dangerous?** Into the pack as charge, if the pack has SoC headroom and the BMS's charge-current limit allows it. If the pack is full, cold, or disconnected, the energy has nowhere to go and the bus voltage rises until something trips or fails. A **brake (dump) resistor** with a chopper clamps the bus by burning excess energy as heat. Mandatory on gravity-loaded or high-inertia axes and anything that might regen into a full or disconnected pack. See [Regen](#regen). **Why does my LFP pack report inaccurate state of charge?** Because LFP's discharge curve is nearly flat — barely 0.1 V of slope across the middle 60% of capacity — so any voltage-based SoC estimate is nearly useless in that band. LFP BMSs rely on coulomb counting (integrating current), which drifts and needs periodic recalibration at the full/empty voltage endpoints. This is LFP's real practical drawback versus NMC's sloped curve. **Can I charge my robot's battery in the cold?** Not below 0 °C without heating. Charging a sub-freezing lithium cell plates metallic lithium on the anode, permanently reducing capacity and creating an internal-short and runaway risk. A proper BMS blocks charge below 0 °C; cold-climate robots add pack heaters that warm the cells before charging. Discharge in the cold is fine (with reduced capacity and higher resistance); charge is the hard limit. **How long will my pack last?** Cycle life depends on chemistry (NMC 500–1,500, LFP 2,000–6,000 cycles to 80% at 80% DoD) and *how* you use it. Heat is the dominant aging factor — every ~10 °C above 25 °C roughly halves calendar life. Shallower depth of discharge, avoiding 100% and 0% dwell, and keeping the pack cool can multiply real-world life. Plan fleet retirement around 70–80% remaining capacity. **Do supercapacitors replace the battery?** No — they store far too little energy (5–10 Wh/kg). They *supplement* it: a supercap bank across the bus sources brief peak currents and absorbs regen spikes, letting a smaller, lower-C-rate battery handle the average. This is genuinely useful on robots with a brutal peak-to-average ratio (legged machines), where it can cut pack mass and cost meaningfully. **Why do industrial arms not have batteries?** Because they don't move themselves around — they're bolted down. They run off AC mains rectified to a DC bus that feeds the servo drives, giving unlimited energy, no weight penalty, and no charging logistics, at the cost of a tether. Mains-fed arms also share a common DC bus across axes so that one decelerating joint's regen can power another accelerating joint. See the [industrial robot arms guide](/posts/industrial-robot-arms-ultimate-guide/). ## Changelog - **2026-05-30** — Initial publication. --- # End Effectors & Robotic Grippers: The Ultimate Guide URL: https://blog.robo2u.com/posts/end-effectors-grippers-ultimate-guide/ Published: 2026-05-28 Updated: 2026-06-20 Tags: end-effectors, grippers, vacuum-gripper, parallel-gripper, soft-robotics, robot-hand, end-of-arm-tooling, robotics-hardware, guide Reading time: 36 min > A working engineer's guide to robotic end effectors — parallel jaw, vacuum, adaptive, soft, and dexterous hands — with real grip-force and payload numbers, the sizing math, and a selection cheat-sheet. A six-axis arm with a perfect controller and no end effector does exactly nothing useful. The end-of-arm tooling — EOAT — is where all that motion converts into work: a part picked, a box stacked, a connector mated, a weld held. Everything upstream is in service of what happens in the last 100 mm. And yet EOAT is routinely the most under-budgeted, under-engineered part of a cell, bolted on as an afterthought after the robot is already chosen. This guide is the long version. We'll go family by family — parallel jaw grippers, vacuum, angular and adaptive, magnetic and specialty, soft, and full dexterous hands — and for each give real numbers with units, real products you can buy, and opinions with reasons attached. Then we'll do the sizing math properly: required grip force as a function of friction and acceleration, vacuum force from cup area and pressure, payload with a real safety factor. The goal is that you finish able to size and select tooling for a specific part, not just recite a taxonomy. **The take**: Grasping is *not* a solved problem at the general level — there is no gripper that picks arbitrary objects from arbitrary poses reliably, which is exactly why dexterous hands stay in research labs. But the *specific* problem is almost always solved, and solved cheaply. Know your part — its mass, geometry, surface, and variability — and the gripper chooses itself. For 80% of industrial picks the answer is a parallel jaw gripper or a vacuum cup, and the engineering effort belongs in the fingertips and the cup selection, not in exotic mechanisms. Companion reading: [robot actuators](/posts/robot-actuators-ultimate-guide/), [robot sensors](/posts/robot-sensors-ultimate-guide/), [industrial robot arms](/posts/industrial-robot-arms-ultimate-guide/), and [collaborative robots (cobots)](/posts/collaborative-robots-cobots-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [The end effector as the business end](#business-end) 3. [Grasp fundamentals](#grasp-fundamentals) 4. [Parallel jaw grippers — the workhorse](#parallel-jaw) 5. [Vacuum & suction grippers](#vacuum) 6. [Angular, 3-finger & adaptive grippers](#adaptive) 7. [Magnetic, needle, Bernoulli & specialty grippers](#specialty) 8. [Soft & compliant grippers](#soft) 9. [Dexterous robot hands](#dexterous) 10. [Actuation & sensing in grippers](#actuation-sensing) 11. [Tool changers & multi-tool](#tool-changers) 12. [Sizing & selecting a gripper](#sizing) 13. [Integration: mounting, I/O, control](#integration) 14. [Frequently asked questions](#faq) ## Key takeaways - The end effector is where the robot meets the world. Grasping is unsolved at the general level but almost always solved for a *specific, known* part — and that's the only kind of part most cells ever see. - Two questions decide most tooling: is the part's top face flat, clean, and sealable (→ vacuum), or does it need to be gripped from the sides with a defined geometry (→ jaws)? Everything else is refinement. - **Parallel jaw grippers** are the workhorse: 2 fingers, symmetric centering, grip forces from ~20 N (small electric, e.g. Robotiq Hand-E) to several thousand newtons (large pneumatic). Electric for control and data, pneumatic for cheap speed and force. - **Vacuum dominates pick-and-place and logistics** because most picked objects have one flat, accessible, sealable face — cartons, sheets, bags, glass, panels. A single 60 mm cup at −60 kPa holds roughly 170 N of theoretical force; derate by 2–4× in practice. - **Adaptive / underactuated grippers** (Robotiq 3-Finger, OnRobot, Schunk) trade peak force and stiffness for the ability to envelop varied geometry with one program — great for mixed parts, mediocre for high-force or high-precision work. - **Soft grippers** (silicone bellows fingers, fin-ray, granular jamming) win on delicate, variable, or food-grade objects where compliance beats force and you can't afford to crush or scratch the part. - **Dexterous hands** (Shadow, Allegro, humanoid hands) have 15–24 DoF and cost €30k–€100k+. They are hard because of tendon routing, sensing density, control, and durability — and they exist almost entirely in research and a few humanoid programs. - Grip force scales with the inverse of friction and grows with acceleration: a part you can hold statically at 10 N may need 40–80 N once the arm slews. Always size against the *worst* point in the trajectory. - Electric grippers give you position, force, and current as data over a fieldbus — invaluable for part presence, sorting by size, and process verification. Pneumatic grippers give you on/off and brute force for less money and faster cycles. - Tactile sensing and slip detection are maturing (GelSight-class optical tactile, capacitive arrays) but remain rare in production; most "force control" in industrial grippers is open-loop current limiting, not true closed-loop force. - **Automatic tool changers** (ATI, Schunk SWS, OnRobot) pay off when one robot must run multiple tools per cycle or per product; they cost payload, stack height, and a few hundred milliseconds per change. - Size payload with dynamics and a safety factor: account for the gripper's own mass and inertia at the flange, and keep a factor of 2× on grip force and ~2× on rated payload after dynamics. - Integration is mostly plumbing and protocol: ISO 9409-1 flange, the right I/O (digital, IO-Link, or fieldbus), clean dry air for pneumatics, and a controller that can command and read the tool. ## The end effector as the business end Strip the marketing and the job of an end effector is simple to state and brutal to execute: form a controlled physical connection to an object, hold it through whatever the robot does, and release it on command. That connection is a grasp, and a grasp has to survive gravity, the arm's own accelerations, process forces (insertion, deburring, the part snagging on a fixture), and time. **EOAT is the whole tool, not just the gripper.** End-of-arm tooling includes the gripper or cups, the mounting bracket and any compliance device, the fingers or fingertips, sensors, the pneumatic or electrical interface, cable management, and often a tool changer. In a real cell the gripper is maybe a third of the EOAT engineering. The fingertips — the custom jaws shaped to your specific part — are frequently where success or failure actually lives, and they're the part no catalog sells you. ### Grasping is unsolved at the general level Here is the uncomfortable truth that the humanoid hype cycle keeps eliding: there is no gripper, and no hand, that can reliably pick an *arbitrary* object from an *arbitrary* pose. Humans do it with 20-some degrees of freedom, dense tactile sensing, and a lifetime of learned manipulation priors, and even we fumble. Robots do far worse. Bin-picking of mixed, unknown objects — the canonical "general grasp" — still has failure rates that would be unacceptable in most processes without retries and recovery logic. What *is* solved, and solved cheaply, is the **specific** grasp: a known part, known mass, known geometry, presented in a known (or vision-estimated) pose. That's what factories and warehouses overwhelmingly have. The art of EOAT is reframing a scary-sounding manipulation problem into the specific grasp you actually face, then choosing the simplest mechanism that handles it. > **Rule of thumb:** If you find yourself reaching for a dexterous hand to solve an industrial pick, you have almost certainly mis-stated the problem. Re-examine the part presentation first. ### Where this fits in the system The end effector lives at the end of a kinematic chain you've already read about: the arm provides reach and pose (see [industrial robot arms](/posts/industrial-robot-arms-ultimate-guide/)), the actuators provide the motion ([robot actuators](/posts/robot-actuators-ultimate-guide/)), and increasingly the cell provides perception ([robot sensors](/posts/robot-sensors-ultimate-guide/)). The gripper is the last link, and it inherits all the constraints of the links above it: payload budget, flange interface, available I/O, and cycle time. ## Grasp fundamentals Before any product, understand the physics of holding. Two concepts do most of the work: **form closure** and **force closure**. ### Form closure vs force closure **Form closure** holds an object by geometry alone — the contacts surround it such that no motion is possible without deforming something, regardless of friction. A part dropped into a perfectly matched nest, or a peg captured in a slot, is form-closed. Form closure is robust and doesn't depend on clamping force, but it requires the gripper geometry to match the part, which is why custom fingertips matter so much. **Force closure** holds an object by friction at the contacts — the gripper squeezes hard enough that friction resists slipping. A parallel gripper pinching a smooth block is force-closed. Force closure is general (works on many shapes with the same jaws) but depends entirely on grip force and the coefficient of friction, and it fails the instant either drops. Most real grasps are a blend: a V-groove fingertip on a round shaft gives partial form closure (the V locates the shaft) plus force closure (the clamp resists axial pull-out). Designing fingertips is mostly about adding form closure so you can lower the force-closure demand — which lets you use a smaller, gentler, faster gripper. ### Friction is the whole game in force closure The force you need to hold a part by friction is set by the coefficient of friction μ between fingertip and part. Steel on dry steel is μ ≈ 0.15–0.3; steel on oily steel can drop below 0.1; nitrile or urethane fingertips on most surfaces give μ ≈ 0.5–1.0. That range spans a 5–10× difference in required grip force. The cheapest performance upgrade in all of EOAT is a soft, high-friction fingertip facing. > **Rule of thumb:** Before adding clamp force, add friction. Doubling μ halves the grip force you need, and high-μ facings cost a few dollars. ### Centering and the part dictates the gripper A parallel gripper with both jaws driven by one symmetric mechanism is **self-centering**: it pulls the part to the gripper's centerline regardless of where the part started (within stroke). That's enormously valuable — it removes part-position error and presents the part to the next station in a repeatable pose. Vacuum, by contrast, picks the part *where it is* and does not center it, which is why vacuum cells lean harder on vision. The single most important design input is the part itself. Write down: mass, dimensions and tolerances, surface (flat? curved? porous? oily? hot?), how it's presented (oriented in a fixture, jumbled in a bin, on a moving belt), how it must be released and into what, and how much the part *varies*. Nine times out of ten, that sheet of paper picks the gripper family before you've looked at a single catalog. ## Parallel jaw grippers — the workhorse If you buy one type of gripper in your career, it'll be this one. A parallel (two-finger) gripper moves two jaws toward and away from each other along a common axis, usually self-centering, to pinch a part. Simple, robust, repeatable, and available from a dozen vendors in hundreds of sizes. ### Anatomy and the numbers that matter The specs that decide a parallel gripper: - **Stroke** (per jaw or total): how far the jaws open. Small electric grippers offer ~5–16 mm per jaw; pneumatic units range from a few mm to 100+ mm total. Your stroke must exceed part size variation plus clearance for approach and release. - **Grip force**: the clamp force at the jaws. Spans roughly 20 N for a small electric gripper up to several thousand newtons for large pneumatic units. This is the headline number for force-closure holding. - **Repeatability**: typically ±0.01–0.05 mm on jaw position for quality units — relevant when you use jaw position to measure or sort parts. - **Closing/opening time**: tens of milliseconds for small pneumatic grippers; electric grippers are often slower (50–500 ms) because they ramp force under control. - **Allowable finger length and moment**: long fingers multiply the moment on the guide bearings. Vendors publish max finger length vs force; exceed it and you wear out or jam the guide. ### Electric vs pneumatic — the real tradeoff This is the decision that matters most, and it's not close once you know the application. **Pneumatic parallel grippers** (SMC MHZ2 series, Festo DHPS/HGPC, Schunk PGN-plus) are cheap, fast, and strong for their size. A piston drives a wedge or rack-and-pinion that converts air pressure into clamp force. At 6 bar (600 kPa) a mid-size pneumatic gripper delivers hundreds of newtons in a compact body, opens and closes in 30–80 ms, and costs a few hundred dollars. The downsides: you get on/off (open/closed), not graded force or position; you need clean dry compressed air and valves; force is set by regulator pressure, not commanded per-pick; and feedback is limited to magnetic reed/Hall switches that confirm end positions. **Electric parallel grippers** (Robotiq Hand-E and 2F-85/2F-140, OnRobot RG2/RG6/2FG7, Schunk EGK/EGU, SMC LEHZ) put a servo or stepper behind a screw or linkage. You command position, speed, and force, and you read all three back over a fieldbus or IO-Link. That data is the point: you can detect part presence (did the jaws close on something or all the way?), sort parts by measured width, verify a grasp by gripping current, and adjust force per product without changing hardware. Robotiq's Hand-E offers a 50 mm total stroke, 20–130 N adjustable grip force, and IP67 sealing; the 2F-85 opens to 85 mm with up to ~235 N. The OnRobot RG6 reaches ~160 mm stroke and up to 120 N. Electric units cost more (often €1,500–€5,000), are slower under controlled force, and have lower peak force per kilogram than pneumatic — but on a cobot or a data-hungry process they win easily. > **Rule of thumb:** Pneumatic when the pick is fixed, fast, and high-force and you already have air. Electric when force or stroke must vary by product, when you want grasp data, or when you're on a cobot with no air and limited I/O. ### Fingertip design — where the work really is The gripper body is a commodity; the fingertips are bespoke and they make or break the cell. Principles: - **Add form closure.** V-grooves locate cylinders; pockets and steps locate prismatic parts; a contoured pocket can index a complex casting in one axis. - **Increase friction where you can't add form.** Nitrile, urethane, or knurled facings raise μ and cut required grip force. - **Mind the moment.** Keep the grip point close to the gripper face; long fingers amplify loads on the guide and reduce allowable force. - **Design for the release**, not just the grab. A part that's hard to let go (sticks to a soft facing, jams in a tight pocket) costs you cycle time and reliability. - **Make them swappable** if you run a family of parts — quick-change finger blanks beat reprogramming. 3D-printed fingertips (often in nylon or TPU) have become standard for prototyping and even production of low-force jaws; for high-force or abrasive work, machined aluminum or steel with bonded urethane pads is the durable answer. ## Vacuum & suction grippers If parallel jaws are the workhorse, vacuum is the *volume leader*. Walk any modern fulfillment center, printing plant, packaging line, or sheet-metal shop and you'll see far more suction cups than mechanical jaws. The reason is structural: most objects worth picking at high volume have at least one flat, clean, accessible, sealable face — a carton top, a glass sheet, a bagged product, a metal panel, a label. Vacuum exploits that face directly. See where this sits in a full cell in the [industrial robot arms guide](/posts/industrial-robot-arms-ultimate-guide/). ### How vacuum holding works A suction cup seals against the part; you evacuate the volume under it; atmospheric pressure on the outside of the part now pushes it against the cup with a force equal to the pressure difference times the effective sealed area. That's it — you're not "sucking" the part, the atmosphere is *pushing* it. Maximum theoretical force is about 101 kPa (one atmosphere) times the cup's effective area, but you never reach full vacuum and you must derate heavily for seal quality, surface, and dynamics. ```text Vacuum holding force: F_vac = ΔP × A_eff where: ΔP = pressure difference (vacuum level), Pa [negative gauge → use magnitude] A_eff = effective sealed area of the cup, m² Example — one 60 mm round cup at −60 kPa vacuum: A_eff ≈ π × (0.030)² = 2.83e-3 m² (≈ 28.3 cm²) ΔP = 60,000 Pa F_vac = 60,000 × 2.83e-3 ≈ 170 N (theoretical, vertical lift, perfect seal) Apply a safety factor S for orientation and dynamics: - vertical lift, smooth handling: S ≈ 2 - horizontal/shear or fast moves: S ≈ 4 So usable hold for this cup: ~40–85 N depending on conditions. ``` The takeaway: cup *area* drives force, and you reach for **more cups or bigger cups**, not deeper vacuum, when you need more hold. Vacuum level above ~−60 to −70 kPa buys little for porous or imperfect surfaces and risks marking delicate parts. ### Venturi (ejector) vs vacuum pump Two ways to make the vacuum, and the choice matters for energy and reliability. **Venturi / ejector** (compressed-air-driven, e.g. Piab piCLASSIC/piGREEN, Schmalz SCPi/SEP) blows compressed air through a nozzle; the Venturi effect drops pressure and evacuates the cup. Pros: no moving parts, instant response, mounts right at the cup (short evacuation volume = fast pick), tolerant of dust, cheap to buy. Cons: they consume compressed air continuously while gripping unless you add an air-saving (blow-off-and-hold) circuit — and compressed air is the most expensive utility in the plant per joule delivered. Multi-stage ejectors (COAX-class) improve efficiency. Best for fast cycles, distributed cups, and dirty environments. **Electric vacuum pump / blower** (central rotary-vane or claw pump, or a regenerative blower) generates vacuum centrally and distributes it. Pros: far more energy-efficient for sustained high flow, very high flow handles porous/leaky parts (cardboard, wood, fabric) that ejectors can't keep up with, no compressed air needed. Cons: capital cost, central plumbing, slower response unless valved locally, maintenance on the pump. Best for high-flow porous handling (corrugated, textiles) and energy-conscious continuous duty. > **Rule of thumb:** Sealable, low-leak parts on fast cycles → ejectors at the cup. Porous, leaky, or high-duty handling → an electric pump sized for *flow*, not just vacuum level. ### Cups, sealing, and surfaces Cup choice is its own discipline. Variables: diameter (drives force), shape (flat for rigid flat parts; bellows for uneven surfaces, height compensation, and gentle compliance; oval for narrow parts), and material/durometer (nitrile for general use and oil resistance; silicone for food and high temp but watch marking; urethane for abrasion; HNBR and special compounds for hot or aggressive parts). Bellows cups (1.5, 2.5, or multi-fold) self-level on tilted parts and add stroke for height variation — invaluable in depalletizing mixed cartons. Sealing is everything: a cup that 90% seals leaks, and on a leaky part an ejector simply can't hold vacuum. Mark-off (residual ring on glossy or painted parts) and ESD requirements drive material and surface treatment choices. ### Multi-cup arrays and zoning For large or variable parts you use arrays. Two patterns: - **Fixed multi-cup tools** with spring-loaded cup mounts so each cup self-levels and only sealing cups contribute — common for sheet metal and glass. - **Zoned / foam vacuum grippers** (Schmalz FXP/FMP foam plates, Piab piСOBOT layout) where a porous foam face or a grid of many small cups covers a large area and a high-flow pump simply tolerates the unsealed cells. This is how a single tool picks cartons of many sizes without retooling — the basis of much robotic depalletizing and order picking. Zoned vacuum (valving the array into independently controlled regions, each with its own check valve) lets you pick a small part with a few cups and a large part with all of them, without losing vacuum through the open cells. ## Angular, 3-finger & adaptive grippers Between the rigid parallel gripper and the soft hand sits a family that trades some force and stiffness for **shape adaptability**. ### Angular grippers Instead of translating, the jaws pivot about a hinge — they swing open and shut like jaws. Angular (and the related radial) grippers are mechanically simple and compact, good where there's no room for linear travel or where a wide swing-clear is useful. The catch: contact geometry changes through the stroke, so they suit a narrow part-size range and are less common than parallel types. ### Three-finger and centric grippers A **3-finger centric gripper** drives three jaws inward symmetrically — excellent self-centering and great for round or hexagonal parts (shafts, bottles, flanges) because three contacts at 120° resist tilt far better than two. Schunk's PZN-plus and many machine-tool loaders use this layout. Three rigid fingers give strong, well-centered grasps on rotationally symmetric parts but are no more general than two when the part is prismatic. ### Adaptive / underactuated grippers The interesting class is **underactuated adaptive** grippers, where one or two motors drive multiple linked finger joints through compliant couplings so the fingers *conform* to the object. The Robotiq 3-Finger Adaptive Gripper is the reference: three articulated fingers, each with multiple phalanges, driven so they automatically switch between **encompassing** (wrapping around an object, power grasp) and **fingertip/pinch** (precise grasp of small parts) modes depending on contact. Total grip force is on the order of 15–60 N per finger range, payload up to ~10 kg, and it handles a remarkable variety of shapes with one program. OnRobot (the 3FG15 three-finger centric gripper, ~10–240 N, up to ~15 kg payload) and various Schunk adaptive units occupy similar ground. The pitch is real: mixed-part handling, machine tending across a family of workpieces, and applications where you can't justify a custom tool per part. The honest limitations: adaptive grippers have lower peak force and lower stiffness than a rigid jaw of the same size, the underactuated compliance means grasp pose is less precisely controlled, and they cost more and weigh more. They're a fine answer for variety; they're the wrong answer for high force, high precision, or fast fixed picks. ## Magnetic, needle, Bernoulli & specialty grippers Plenty of parts don't suit jaws or cups. The specialty families: **Magnetic grippers.** For ferrous parts (steel sheet, stampings, tools), an electromagnet or a switchable permanent magnet (e.g. Schunk EMH, Goudsmit) holds with high force per area and tolerates oil, dirt, and rough surfaces that defeat vacuum. Switchable permanent ("electro-permanent") magnets hold with zero power and only need power to switch — fail-safe against power loss. Watch for: residual magnetism left in the part, picking *two* sheets at once (use fanners/destackers), and the obvious — non-ferrous parts need not apply. **Needle (pin) grippers.** Fine needles drive at opposing angles into porous or fibrous material (textiles, carbon-fiber preforms, foam, leather) and interlock mechanically. They're the standard answer for limp fabric handling, where neither cups nor jaws get a grip. The trade is small visible needle marks and limited force per gripper. **Bernoulli (non-contact) grippers.** A high-velocity radial air flow under a flat head creates a low-pressure region (Bernoulli effect) that lifts the part toward the head while the air film keeps it from touching — near-contactless holding with side pins for centering. Used for delicate, thin, or contamination-sensitive parts: silicon wafers, solar cells, thin films, food slices. They consume a lot of air and hold relatively gently, but the non-contact, shear-tolerant grip is unique. (The same physics is sometimes called a "cyclone" or "vortex" gripper.) **Electrostatic and gecko-inspired grippers.** Electroadhesive pads hold non-magnetic, flat, even non-sealable items (PCBs, fabrics, films) with modest force; gecko-inspired microstructured adhesives (dry adhesion, as developed for space/solar handling) hold smooth surfaces without residue. Both are niche but growing in clean and delicate handling. **Ice / cryogenic and adhesive grippers.** For irregular soft food (fish fillets, dough), freezing a thin contact layer or using a controlled adhesive can grip where nothing mechanical will. Rare, process-specific, but real. ## Soft & compliant grippers Soft robotics tackles the opposite end from the rigid jaw: objects that are delicate, deformable, irregular, slippery, or biological — produce, baked goods, soft consumer products, living tissue, anything that varies part-to-part and can't tolerate a hard clamp. ### Pneumatic silicone fingers (bellows actuators) The dominant commercial form is the **pneumatic bending finger**: a molded silicone or elastomer chamber with an asymmetric wall that curls when inflated, wrapping around an object with gentle, distributed force. Soft Robotics Inc.'s mGrip/SuperPick tooling is the reference — food-safe, washdown-rated modules that pick irregular produce and proteins at line rates. Grip is gentle (a few newtons distributed), compliance is automatic (the finger conforms to whatever shape it meets), and one tool handles wide part variation. The cost: low force, finite fatigue life of the elastomer, and air supply. For food and delicate variable handling, nothing else is as turnkey. ### Fin-ray effect fingers The **Fin Ray** structure (inspired by fish fins, commercialized by Festo as the basis of many adaptive fingers, and now made by many vendors including in 3D-printed TPU) is a triangular rib structure that *bends toward* a force applied to its flank — so when it presses on an object, it wraps around it passively, no extra actuation needed. Fin-ray fingers bolt onto an ordinary parallel gripper and instantly give it shape-adaptive, gentle, self-conforming jaws. They're cheap, passive, printable, and a genuinely good upgrade for handling mixed or fragile rigid parts. Limits: low stiffness and force, and they wear. ### Granular jamming (the "universal gripper") The famous **jamming gripper**: a flexible membrane filled with granular material (coffee grounds, glass beads) presses down over an object to conform around it, then a vacuum is applied to the membrane, jamming the grains into a rigid solid that grips by a combination of friction, suction, and geometric interlock. One tool grips an enormous variety of shapes with no per-part programming. The catches are real, though: it needs to press *onto* the part (top access, some force), the grip is modest and not precisely controlled, cycle time includes jam/unjam, and the membrane wears. It's a clever, well-publicized mechanism that stays mostly in research and a few niche cells. > **Rule of thumb:** Reach for soft grippers when the part is delicate, deformable, or highly variable and force precision doesn't matter. Don't reach for them when you need stiffness, high force, fast fixed picks, or tight grasp-pose control. ## Dexterous robot hands At the far end of the spectrum are anthropomorphic, multi-fingered **dexterous hands** — the things that make humanoid renders look magical and that, in reality, remain among the hardest hardware in robotics. They tie directly into the [humanoid robot hardware guide](/posts/humanoid-robot-hardware-ultimate-guide/). ### What "dexterous" means in DoF A human hand has roughly 21–27 functional degrees of freedom. Research hands approximate this: - **Shadow Dexterous Hand** — ~20 actuated DoF (24 joints), tendon-driven from a forearm of actuators, with tactile fingertips. The most anthropomorphic widely cited hand; price is on the order of €100k+ and it's a research instrument, not a production tool. - **Allegro Hand** (Wonik Robotics) — 16 DoF, 4 fingers, direct-drive-ish geared motors in the fingers, a popular research platform at roughly €20k–€30k because it's far simpler and more robust than a Shadow. - **Humanoid hands** — Tesla Optimus, Figure, Sanctuary, 1X and others have iterated hands in the ~11–22 DoF range, mixing tendon drive (motors in the forearm pulling cables) with some in-hand actuation, and they're a major focus precisely because the hand gates what a humanoid can actually *do*. ### Tendon drive vs in-hand direct drive The central design fork: **Tendon-driven** hands put the motors in the forearm and route cables (tendons) through the fingers, like biology. This keeps finger mass and size low (slim, fast fingers) but brings tendon friction, stretch, routing wear, and the control headache of cable dynamics. Most highly anthropomorphic hands (Shadow, many humanoids) are tendon-driven for the form factor. **In-hand / direct-geared** hands put small motors at or near the joints (Allegro-style). Simpler control and no cable maintenance, at the cost of bulkier, heavier fingers and lower DoF density. ### Why dexterous hands are hard It's worth being blunt about the failure modes, because they explain the price tags and the absence from factories: - **Actuation density.** Packing 16–20 controllable, force-capable joints into a hand-sized envelope is brutal — every gram and millimeter fights you. - **Sensing.** Real manipulation needs dense tactile and force sensing on every fingertip and ideally the whole surface; that sensing is fragile, expensive, and hard to wire. - **Control.** Coordinating 20 DoF for stable grasps and in-hand reorientation is an unsolved-in-general control and learning problem; teleoperation and imitation learning are the current crutches. - **Durability.** Tendons stretch and fray, soft fingertips wear, and a hand takes more impacts than any other part of the robot. - **Cost.** All of the above puts capable hands at €20k–€100k+, which no industrial pick can justify when a €400 gripper does the job. The honest verdict: dexterous hands are a research and humanoid-development tool, justified when general manipulation is the *product* (humanoids, prosthetics, telepresence in hazardous environments), and almost never the right answer for a known industrial task. ## Actuation & sensing in grippers A gripper is itself a little actuator-plus-sensor system, and the same tradeoffs from the [actuators guide](/posts/robot-actuators-ultimate-guide/) and [servo motors guide](/posts/servo-motors-ultimate-guide/) apply in miniature. ### Electric servo vs pneumatic, again — at the mechanism level Pneumatic grippers convert air pressure to clamp force via a piston and a force-multiplying linkage (wedge, cam, rack-and-pinion). Force is set by supply pressure and the mechanism's mechanical advantage; you change force by changing the regulator. Fast, strong, cheap, binary. Electric grippers put a brushless or stepper motor behind a screw (ball or lead) or a linkage; a small drive — often running field-oriented control, see [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/) — commands position and current. Because **motor current is roughly proportional to torque, and torque maps to clamp force through the mechanism**, you can set and read grip force by controlling current. That's how an electric gripper offers "adjustable force" without a load cell — it's current-based force estimation, not a true force sensor. ### Force control — what's real and what's marketing Be precise about "force control": - **Open-loop / current-limited** (most electric grippers): the drive limits motor current to a setpoint, which *estimates* clamp force through the (friction-laden, sometimes nonlinear) mechanism. Good enough to avoid crushing parts and to grade force by product; not metrologically accurate. - **Closed-loop force** (rare in production grippers, common in research hands): an actual force or torque sensor in the loop, controlling contact force directly. This is what you need for true delicate manipulation and what most dexterous hands aim for. For most industrial picks, current-limited "force control" is entirely adequate — you just need to know that's what you're buying. ### Tactile feedback and slip detection The frontier sensing, mostly still emerging in production: - **Force/torque at the wrist** — a 6-axis F/T sensor (ATI, OnRobot HEX, Bota) above the gripper measures contact forces for assembly, insertion, and polishing. Mature and widely used, though it senses at the wrist, not the fingertip. See [robot sensors](/posts/robot-sensors-ultimate-guide/). - **Tactile arrays** — capacitive, resistive, or MEMS pressure arrays on fingertips give a contact pressure map. Useful for grasp quality and centering; durability and wiring are the obstacles. - **Optical tactile (GelSight-class)** — a camera images a soft, marked gel as it deforms against the object, recovering a high-resolution surface and shear map. Spectacular data density, used heavily in research manipulation; bulky and still maturing for the field. - **Slip detection** — sensing incipient slip (via vibration, shear measurement, or tactile flow) so the gripper can increase force *just enough*. This is how humans grip with minimal force, and it's the holy grail for gentle, energy-minimal grasping. A few products exist; most cells still just clamp harder. > **Rule of thumb:** For industrial picks, put your sensing budget into a wrist F/T sensor and grip-current monitoring. Fingertip tactile and slip detection are worth it only when the manipulation itself is the hard part. ## Tool changers & multi-tool One robot, several jobs: a cell may need to grip a part, set it down, then deburr it; or run product A with a vacuum tool and product B with jaws. The answer is an **automatic tool changer (ATC)**. ### How they work An ATC is two halves: a **master** bolted to the robot flange and a **tool** plate on each end effector, with a locking mechanism (pneumatic piston driving balls into a locking ring is the common ATI/Schunk design) and pass-throughs for air, electrical signals, fieldbus, and sometimes fluid or high power. The robot drives the master into a tool sitting in a dock, locks, and carries it away; reverse to drop it. Vendors: ATI Industrial Automation (the QC series is the reference), Schunk SWS, OnRobot Quick Changer for the lighter cobot world. ### When they pay off — and what they cost ATCs earn their place when: - one robot must use **multiple distinct tools per cycle or per product**, and - the alternative (a separate robot per tool, or a giant combination tool) is more expensive, or - you need **tool maintenance/swap without re-teaching** (changers are highly repeatable, ±0.01–0.02 mm). They cost you real things: **payload and reach** (the changer adds mass at the flange and stack height that pushes the tool further from the wrist, hurting your moment budget), **time** (a change is typically 1–5 seconds including the move to the dock), **complexity** (docks, more I/O, more pneumatics), and **money**. A combination tool (vacuum *and* jaws on one bracket, selected by program) is often the better answer when you only need two simple tools and have the payload — no docking move, no change time. > **Rule of thumb:** If you'd change tools more than a few times an hour and the tools are heavy or numerous, use a changer. If it's two light tools you switch rarely, build a combo tool and skip the changer. ## Sizing & selecting a gripper Now the math. This is where most EOAT goes wrong — by sizing on static weight and ignoring dynamics. ### Step 1 — required grip force (force closure) To hold a part by friction against gravity *and* the arm's accelerations: ```text Required grip force (two opposing jaws, friction grip): F_grip ≥ (m × (g + a) × S) / (2 × μ) where: m = part mass, kg g = 9.81 m/s² a = worst-case acceleration of the part from robot motion, m/s² μ = coefficient of friction, fingertip–part S = safety factor (≥ 2 typical) 2 = two friction surfaces (one per jaw) Example — 2 kg steel part, urethane fingertips (μ ≈ 0.6), robot peak accel a ≈ 20 m/s² (~2 g), S = 2: F_grip ≥ (2 × (9.81 + 20) × 2) / (2 × 0.6) = (2 × 29.81 × 2) / 1.2 = 119.24 / 1.2 ≈ 99 N per ... → need a gripper rated ≥ ~100 N grip force ``` Two things jump out. First, **acceleration roughly tripled the demand** versus the static 33 N you'd get with a=0 at the same S=2. Second, **friction is a divisor** — drop μ to 0.15 (oily steel) and the same part needs ~400 N. Worst-case acceleration includes the part being flung in a slew, not just lifted; for shear/horizontal holds the geometry changes and you size against the worst orientation in the path. ### Step 2 — vacuum sizing (if vacuum) Use the `F_vac = ΔP × A_eff` relation from the vacuum section, derate by S = 2 (vertical, gentle) to 4 (shear, fast), and pick cup count and diameter so the *sum* of usable cup forces beats the demand. Size the **flow** (ejector or pump) for the part's leakage, not just the vacuum level — porous parts are flow-limited, not pressure-limited. ### Step 3 — payload at the flange (dynamics included) The robot's rated payload must cover **part mass + gripper mass + tool-changer/sensor mass**, and the *moment* those create at the wrist matters as much as the mass. A 3 kg part on a 2 kg gripper 150 mm off the flange can exceed a "5 kg" robot's allowable wrist moment even though 3+2 < 5. Check the robot's payload-vs-inertia chart, not just the headline number. Keep ~2× margin on rated payload after you've added everything and accounted for acceleration. ### Step 4 — stroke, cycle time, variability - **Stroke** ≥ part size variation + approach/release clearance + fixture clearance. - **Cycle time**: budget the gripper's open/close time (pneumatic ~30–80 ms; electric 50–500 ms; vacuum pick/release depends on volume and flow). On fast lines the gripper, not the arm, can be the bottleneck. - **Variability**: if the part varies a lot in shape, you're pushed toward adaptive, soft, or zoned-vacuum tools — at the cost of force and precision. ### The decision tree > **The 30-second selector:** > 1. **Is there one flat, clean, sealable, accessible face?** → Vacuum (ejector if sealable/fast, pump if porous/high-duty). Add cups for force, zone them for variety. > 2. **No good vacuum face, part is rigid with defined sides?** → Parallel jaw (electric for data/variable force, pneumatic for cheap fast force). Engineer the fingertips. > 3. **Rigid but round/symmetric or family of sizes?** → 3-finger centric or adaptive gripper. > 4. **Delicate, deformable, food, or highly variable?** → Soft (silicone bellows, fin-ray, jamming). > 5. **Ferrous and flat?** → Magnetic (electro-permanent for fail-safe). > 6. **Limp fabric / porous sheet?** → Needle. **Thin, fragile, contamination-sensitive?** → Bernoulli/non-contact. > 7. **General manipulation is the product (humanoid/research)?** → Dexterous hand — and budget accordingly. ### Comparison tables **Gripper-type comparison** | Gripper type | Typical grip/hold force | Payload range | Best for | Weakness | Rep. cost | |---|---|---|---|---|---| | Parallel jaw, pneumatic | ~50–3,000+ N | 0.1–20+ kg | Fast fixed picks, high force | On/off only, needs air | $200–$1,500 | | Parallel jaw, electric | ~20–400 N | 0.1–10 kg | Data, variable force, cobots | Slower, lower N/kg | $1,500–$5,000 | | Vacuum, single/array | ~20–2,000+ N (cup-dependent) | 0.1–50+ kg | Flat/sealable faces, logistics | Needs sealable face | $100–$3,000 | | 3-finger centric | ~30–300 N | up to ~15 kg | Round/symmetric parts | Less general than it looks | $1,000–$8,000 | | Adaptive/underactuated | ~15–240 N | up to ~15 kg | Mixed-part variety | Low force/stiffness | $5,000–$20,000 | | Soft (silicone/fin-ray) | a few N, distributed | up to ~a few kg | Delicate/variable/food | Low force, wear | $500–$10,000 | | Magnetic | high per area | up to 100s kg | Ferrous, dirty surfaces | Ferrous only, double-pick | $300–$5,000 | | Dexterous hand | per-finger, low–moderate | task-dependent | General manipulation R&D | Cost, durability, control | $20k–$100k+ | **Vacuum vs mechanical decision table** | Factor | Favors vacuum | Favors mechanical (jaws) | |---|---|---| | Part face | One flat, clean, sealable face | No sealable face; gripped from sides | | Surface | Smooth, non-porous (or pump for porous) | Any; rough/oily fine with right fingertips | | Centering needed | No (or vision handles it) | Yes — self-centering jaws fix part pose | | Cycle speed | Very fast (ejector at cup) | Fast (pneumatic), slower (electric) | | Part variety | High (zoned/foam tool) | Low–moderate (per-part fingertips) | | Force/shear demand | Low–moderate, mostly normal | High, including shear | | Cleanliness/marking | Risk of mark-off on glossy parts | Can mar with hard jaws; soft facings help | | Utility cost | Air-hungry (ejector) or pump capex | Air (pneumatic) or none (electric) | **Real-product spec snapshot** | Product | Type | Stroke / cup | Grip / hold force | Payload | Interface | Notes | |---|---|---|---|---|---|---| | Robotiq Hand-E | Electric parallel | 50 mm total | 20–130 N (adj.) | ~5 kg | IO-Link/fieldbus | IP67, cobot-focused | | Robotiq 2F-85 | Electric parallel | 85 mm | up to ~235 N | ~5 kg | fieldbus | Wide opening | | OnRobot RG6 | Electric parallel | up to ~160 mm | up to 120 N | ~6 kg | OnRobot tool I/O | Long stroke | | Schunk PGN-plus-P | Pneumatic parallel | size-dependent | up to several kN | up to 10s kg | air + reed/Hall | Industrial workhorse | | SMC MHZ2 | Pneumatic parallel | a few–30+ mm | ~10s–100s N | small parts | air | Compact, cheap | | Robotiq 3-Finger | Adaptive 3-finger | encompass/pinch | ~15–60 N range | ~10 kg | fieldbus | Mode-switching | | Piab piCOBOT | Ejector vacuum | cup-dependent | cup-dependent | ~10–12 kg sys | IO-Link | Cobot vacuum kit | | Schmalz FXP/FMP | Foam vacuum plate | full-area foam | area-dependent | up to 10s kg | pump + valves | Mixed-carton picking | | Soft Robotics mGrip | Soft silicone fingers | conforming | a few N, gentle | up to a few kg | air | Food/washdown | | Allegro Hand | Dexterous (16 DoF) | — | per-finger | task | CAN/EtherCAT | Research platform | *(Figures are representative of catalog values circa 2024–2026; always confirm against the current datasheet for your exact size and revision.)* ## Integration: mounting, I/O, control A gripper that's right on paper still has to bolt on, get power and signals, and be commanded. Integration is mostly plumbing and protocol — and it's where schedule slips hide. This ties into the broader cell picture in the [cobots guide](/posts/collaborative-robots-cobots-ultimate-guide/) and [industrial automation (PLC/SCADA/fieldbus) guide](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/). ### The mechanical interface — ISO 9409-1 Most robot wrists present an **ISO 9409-1** circular flange (e.g. a 50-4-M6 or 63-4-M6 pattern: a bolt circle, a locating boss, and a dowel hole for repeatable angular alignment). Match your gripper's mounting plate to the robot's flange code, or machine an adapter. Use the dowel — bolts alone let the tool rotate over time. Account for stack height: every adapter, sensor, and changer pushes the gripper further from the wrist and eats moment budget. ### Electrical / signal I/O Three common levels: - **Discrete digital I/O** — simplest, for pneumatic grippers and basic sensors: a couple of outputs to drive solenoid valves, a couple of inputs from reed/Hall position switches. The robot's tool-side connector usually breaks out a handful of 24 V lines. - **IO-Link** — a point-to-point digital link to a single device that carries parameters and diagnostics over the same wire; increasingly standard for smart grippers (set force/position, read status) without a full fieldbus drop at the tool. - **Fieldbus** (EtherCAT, PROFINET, EtherNet/IP, Modbus) — full data exchange for electric/adaptive grippers and F/T sensors: command position/speed/force, read back everything. This is where the gripper becomes a data source for the line. Plan the tool-side cabling and a robust connector at the wrist; cable flex and chafe at the wrist is a leading cause of intermittent EOAT faults. ### Pneumatics — get the air right For pneumatic grippers and ejectors: supply **clean, dry, regulated air** (a filter-regulator, ideally with coalescing filtration and a dryer upstream — moisture and oil kill seals and foul ejectors). Size tubing for flow, not just pressure — a starved ejector won't reach vacuum level. Mount solenoid valves close to the tool to cut response delay (dead volume slows both clamping and vacuum pickup). Add flow controls to tune jaw speed and reduce impact. ### Control and the robot program Finally, the robot has to *use* the gripper: drivers/URCaps/plugins for the controller, a grip/release in the program with the right dwell (let pneumatics seat, let vacuum build, confirm before moving), and feedback handling — check part-present before transit, handle a failed grasp with a retry or fault. The best cells treat grasp confirmation as a first-class signal, not an afterthought; a dropped part detected at the gripper is cheap, a dropped part discovered three stations later is expensive. > **Rule of thumb:** Budget grasp *confirmation* into the cycle — gripper position/current, vacuum-on feedback, or a presence sensor. Verifying the grasp before you move is the single highest-leverage reliability investment in EOAT. ## Frequently asked questions **What's the difference between an end effector and a gripper?** The end effector is anything mounted at the robot's wrist to do work — a gripper, a vacuum tool, a welding torch, a screwdriver, a dispenser. A gripper is the subset of end effectors that grasps and holds objects. "EOAT" (end-of-arm tooling) is the whole assembly: gripper plus bracket, fingers, sensors, and interface. **Electric or pneumatic gripper — which should I choose?** Pneumatic if the pick is fixed, fast, and high-force and you already have compressed air: cheaper, faster, stronger per kilogram, but on/off only. Electric if force or stroke must vary by product, if you want grasp data (position, force, current) over a fieldbus, or if you're on a cobot with no air: more controllable and informative, but pricier and slower under controlled force. **How much grip force do I actually need?** Size it as F ≥ m·(g+a)·S / (2·μ): part mass times gravity-plus-acceleration, times a safety factor (≥2), divided by twice the friction coefficient. Acceleration often doubles or triples the static demand, and low friction (oily steel, μ≈0.1) can multiply it several-fold. Add friction (soft, high-μ fingertips) before adding force — it's the cheapest fix. **When is vacuum the right choice over mechanical jaws?** When the part has one flat, clean, accessible, sealable face — cartons, sheets, glass, panels, bags. Vacuum is fast and handles huge part variety with zoned/foam tools, which is why it dominates logistics and packaging. Use jaws when there's no sealable face, when you need to grip from the sides, when you need self-centering, or when forces are high and include shear. **Venturi ejector or electric vacuum pump?** Ejectors (compressed-air-driven) for sealable, low-leak parts on fast cycles — instant response, mount at the cup, cheap, but air-hungry. Electric pumps/blowers for porous, leaky, or high-duty handling (corrugated, fabric) where you need high *flow*, and for energy efficiency in continuous duty. Size vacuum tools for flow on leaky parts, not just vacuum level. **Why are dexterous robot hands so expensive and so rare in factories?** Because packing 16–24 controllable, sensed, durable joints into a hand-sized envelope is extraordinarily hard — actuation density, fragile tactile sensing, unsolved general control, and tendons that wear all stack up. The result costs €20k–€100k+, and no industrial pick can justify that when a €400 gripper does the specific job. They make sense only when general manipulation is the actual product (humanoids, prosthetics, hazardous telepresence). **What is the difference between form closure and force closure?** Form closure holds a part by geometry — the contacts surround it so it can't move regardless of friction (a part in a matched nest). Force closure holds by friction — the gripper squeezes hard enough that friction resists slipping. Good fingertip design adds form closure (V-grooves, pockets) so you can lower the force-closure demand and use a smaller, gentler gripper. **Are soft grippers strong enough for real production?** For the right parts, yes — but "strong" isn't their point. Pneumatic silicone fingers, fin-ray jaws, and jamming grippers deliver gentle, distributed, conforming grasps for delicate, deformable, or highly variable objects (produce, proteins, soft goods). They're in real food and consumer-goods production. Don't use them where you need stiffness, high force, fast fixed picks, or precise grasp-pose control. **Do I need a tool changer?** Only if one robot must run multiple distinct tools per cycle or per product and a combo tool or separate stations can't do it more cheaply. Changers are highly repeatable (±0.01–0.02 mm) but cost payload, stack height, ~1–5 s per change, and complexity. For two light tools you switch rarely, build a combination tool and skip the changer. **How does an electric gripper "control force" without a force sensor?** Through motor current. In a servo gripper, current is roughly proportional to torque, and torque maps to clamp force through the mechanism — so limiting current sets an estimated grip force, and reading current estimates the actual force. It's current-based estimation, not metrology: good enough to avoid crushing parts and grade force by product, but not a true closed-loop force measurement. Hands that need real delicate manipulation add actual fingertip force sensors. **What sensing should I add to a gripper?** For most industrial picks, prioritize a wrist 6-axis force/torque sensor (for assembly, insertion, polishing) and grip-current/position monitoring for grasp confirmation. Fingertip tactile arrays, optical tactile (GelSight-class), and slip detection are powerful but mostly worth it only when the manipulation itself is the hard part — research, dexterous hands, and delicate variable handling. **What flange and interface will the gripper bolt to?** Most robot wrists use an ISO 9409-1 circular flange (a coded bolt pattern with a locating boss and dowel). Match the gripper plate to the robot's flange code or make an adapter, and use the dowel for repeatable alignment. For signals, expect discrete 24 V I/O for simple pneumatic tools, IO-Link for smart single devices, or a fieldbus (EtherCAT/PROFINET/EtherNet/IP) for full data exchange with electric and adaptive grippers. ## Changelog - **2026-05-28** — Initial publication. --- # Industrial Robot Arms: 6-Axis, SCARA & Delta — The Ultimate Guide URL: https://blog.robo2u.com/posts/industrial-robot-arms-ultimate-guide/ Published: 2026-05-26 Updated: 2026-06-20 Tags: industrial-robots, robot-arm, 6-axis-robot, scara, delta-robot, payload, repeatability, manufacturing-automation, guide Reading time: 38 min > A working engineer's guide to industrial robot arms — 6-axis articulated, SCARA, and delta — with real FANUC/ABB/KUKA/Yaskawa specs, the math behind payload and cycle time, repeatability vs accuracy, and a selection cheat-sheet. An industrial robot arm is the most general-purpose motion machine ever mass-produced. Bolt one to the floor, give it a tool and a program, and it will weld a car body this year, palletize cartons next year, and tend a CNC the year after — same hardware, different software and tooling. There are roughly four million of these installed and working worldwide, and the global fleet grows by something like half a million units a year. That is not hype; that is the installed base doing the unglamorous work of modern manufacturing. This guide is the long version, written for the people who actually specify, integrate, and commission these machines. We'll go configuration by configuration — articulated 6-axis, SCARA, delta, and the cartesian and cylindrical also-rans — and for each give real numbers with units, real products you can buy, and opinions with the reasons attached. Then we'll do the parts engineers get wrong: payload sized with dynamics rather than catalog headline, repeatability versus accuracy (they are not the same thing and the difference will cost you), cycle-time estimation, and the controller and safety realities that decide whether a cell ships on time. **The take**: The robot arm is almost never the hard part of a cell, and it is almost never where projects fail. The mechanism is mature, the big four vendors are all excellent, and repeatability of ±0.02–0.05 mm is a commodity. Projects fail on the *system* around the arm — tooling, part presentation, cycle-time math done optimistically, safety designed last, and a payload budget that ignored the gripper's inertia. Pick the configuration from the task, size against the worst point in the trajectory, and spend your engineering where it actually lives: the cell, not the catalog. Companion reading: [collaborative robots (cobots)](/posts/collaborative-robots-cobots-ultimate-guide/), [harmonic & cycloidal gearboxes](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/), [robot actuators](/posts/robot-actuators-ultimate-guide/), [end effectors & grippers](/posts/end-effectors-grippers-ultimate-guide/), and [motion planning & kinematics](/posts/motion-planning-kinematics-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [What an industrial robot arm is](#what-it-is) 3. [Kinematic configurations compared](#configurations) 4. [The 6-axis articulated arm, anatomy](#six-axis-anatomy) 5. [SCARA deep-dive](#scara) 6. [Delta & parallel robots deep-dive](#delta) 7. [The specs that actually matter](#specs) 8. [Repeatability vs accuracy](#repeatability-accuracy) 9. [Controllers & programming](#controllers) 10. [End-of-arm tooling & integration](#eoat) 11. [Motion: trajectory, singularities, TCP](#motion) 12. [Safety & guarding](#safety) 13. [Selecting & deploying an arm](#selecting) 14. [Frequently asked questions](#faq) ## Key takeaways - An industrial robot arm is a programmable, multi-axis manipulator — most commonly a serial chain of rigid links and rotary joints ending in a tool flange. The articulated 6-axis arm is the default because six degrees of freedom is the minimum to reach an arbitrary position *and* orientation in 3D space. - The market is dominated by the "big four" — **FANUC, ABB, KUKA, and Yaskawa (Motoman)** — with strong specialists like **Stäubli** (precision/cleanroom), **Epson, Mitsubishi, and Omron** (SCARA), and **Kawasaki, Nachi, Comau, and Hyundai/Hanwha** filling out the field. Roughly 4 million units are installed worldwide. - **Configuration follows task.** 6-axis articulated for arbitrary orientation and reach (welding, assembly, machine tending, painting); **SCARA** for fast planar pick-place and vertical insertion; **delta** for the highest-rate lightweight picking and sorting; cartesian/gantry for long-stroke, high-payload, large-envelope work. - **Six axes, six reasons.** J1–J3 (the "arm") set the wrist's position; J4–J6 (the "wrist") set orientation. Each joint is a motor plus a precision reducer — usually an **RV cycloidal** gear on the heavy lower axes and a **harmonic drive** on the lighter wrist axes. - **SCARA dominates high-speed assembly** because its selective compliance — stiff vertically, compliant horizontally — is exactly what peg-in-hole insertion wants, and 4 axes is all a flat-world pick-place needs. Cycle times of ~0.3–0.5 s for a standard 25/305/25 mm move are routine. - **Delta robots are insanely fast** because the motors stay on the fixed base and only thin carbon arms move — minimal moving mass means accelerations of 100–150 m/s² and rates above 150–200 picks/min, at the cost of small payloads (typically ≤3 kg) and a domed work envelope. - **Payload is not the catalog number.** Rated payload includes the end-effector *and* the part *and* the dynamic loads from acceleration, and it is constrained by the allowable moment of inertia about the wrist axes. A 20 kg-rated arm with a heavy offset gripper may only safely carry a 12 kg part. - **Repeatability ≠ accuracy.** A typical 6-axis arm repeats to ±0.02–0.05 mm but may be *accurate* only to ±0.5–1 mm out of the box. Repeatability lets you teach points; accuracy (after calibration) is what offline programming needs. - **Controllers are the real moat.** Teach pendant plus vendor language — KUKA **KRL**, ABB **RAPID**, FANUC **KAREL**/TP, Yaskawa **INFORM** — plus offline tools like **RoboDK** and vendor sims. The cabinet, not the arm, is where motion quality and integration live. - **Safety is standards-driven, not optional.** Traditional industrial arms run fenced under **ISO 10218** with light curtains, interlocked gates, and safety-rated monitored stops. Cobots (ISO/TS 15066) trade speed and payload for fenceless operation — a different tool for a different job. - **Cycle time, not peak speed, sells the cell.** Headline "2000 mm/s" tool speeds are never sustained; real throughput is dominated by acceleration, deceleration, settling, and the dwell for the actual process (grip, weld, dispense). - **Buy the configuration the task needs, then size with margin.** Keep ~20–30% headroom on payload after dynamics, confirm reach to the *furthest* point with the tool's offset, and validate cycle time in the vendor sim before signing the PO. ## What an industrial robot arm is Strip the marketing and an industrial robot arm is a **programmable, reprogrammable, multi-purpose manipulator** with three or more axes — that's essentially the ISO 8373 definition, and it's a good one. The "reprogrammable, multi-purpose" part is what separates a robot from a dedicated piece of machinery. A cam-driven assembly machine does one thing forever. A robot does whatever you teach it, and you can re-teach it next quarter. The dominant form is the **articulated serial manipulator**: a chain of rigid links connected by rotary joints, anchored to a base at one end and terminating in a mechanical interface (the tool flange) at the other. Each joint is independently driven, almost always by a servo motor through a precision gear reducer, with a feedback device (an [encoder](/posts/encoders-ultimate-guide/)) closing the loop. The controller solves the kinematics — given a desired flange pose, what joint angles get you there — and coordinates all axes so the tool follows the path you programmed. ### The big four, and the rest The industrial robot business is unusually concentrated. Four vendors own the majority of the articulated-arm market between them: - **FANUC** (Japan) — yellow arms, enormous installed base, legendary reliability and uptime, deep CNC/automation integration. The default in automotive and a safe bet anywhere. - **ABB** (Sweden/Switzerland) — the **IRB** series, strong in welding, painting, and pick-place; the IRC5/OmniCore controllers and **RAPID** language are widely liked. - **KUKA** (Germany, now owned by Midea) — orange **KR** arms, strong in automotive body-in-white, the **KRL** language and the well-regarded KR C controllers. - **Yaskawa Motoman** (Japan) — huge in arc welding and handling; **INFORM** language, the YRC1000 controller, and a massive servo heritage (Yaskawa is also a top servo-drive maker). Beyond the big four, the specialists matter when the task is specialized. **Stäubli** (Switzerland) builds the precision and cleanroom arms you reach for in medical, semiconductor, and pharma — tighter repeatability, fully enclosed for washdown and ISO Class cleanrooms. **Epson, Mitsubishi, Omron (Adept heritage), and Yamaha** dominate SCARA. **ABB (FlexPicker), Fanuc, and Codian** lead delta. And **Kawasaki, Nachi, Comau, Hyundai/Hanwha, Denso, and Doosan/Hyundai** round out a field where, frankly, all the major players build good machines. There are no bad big-vendor arms; there are only mismatches between arm and task. ### The installed base context The International Federation of Robotics tracks a global operational stock of industrial robots in the low-to-mid millions — on the order of **4 million units** working in factories worldwide as of the mid-2020s, with annual installations of roughly half a million units. Automotive and electronics are the two biggest consumers; metal, plastics, food, and logistics follow. China is by far the largest single market and the fastest-growing. The point for an engineer: this is mature, high-volume technology with deep supply chains, abundant spares, and a large pool of trained integrators. You are not pioneering. ## Kinematic configurations compared Before specs, configuration. The mechanical arrangement of axes — the **kinematic structure** — determines the shape of the work envelope, the achievable speed and payload, the stiffness, and what kinds of tasks the arm is good at. Get this choice right and everything downstream is easier. (For the underlying math of forward/inverse kinematics, see [motion planning & kinematics](/posts/motion-planning-kinematics-ultimate-guide/).) There are five configurations worth knowing: - **Articulated (6-axis)** — the human-arm analog: serial rotary joints. Maximum dexterity and orientation freedom. The general-purpose default. - **SCARA** — Selective Compliance Assembly Robot Arm: two parallel rotary joints in a horizontal plane plus a vertical (Z) and a rotation (theta). Fast and stiff in Z, compliant in the horizontal plane. - **Delta / parallel** — three (or four) arms driven from a fixed base move a small platform. Light, blisteringly fast, limited payload and envelope. - **Cartesian / gantry** — three linear axes (X, Y, Z) at right angles. Simple kinematics, huge envelope, very high stiffness and payload, but bulky. - **Cylindrical** — a rotary base plus a vertical and a radial (prismatic) axis. Largely legacy now, occasionally seen in machine tending and dispensing. | Configuration | Axes / DoF | Work envelope shape | Typical payload | Typical repeatability | Top tasks | Weak at | |---|---|---|---|---|---|---| | **Articulated 6-axis** | 6 (rotary) | Spherical-ish, large | 3–800+ kg | ±0.02–0.06 mm | Welding, assembly, machine tending, painting, palletizing | Highest pick rates; envelope per footprint | | **SCARA** | 4 (3 rotary + Z) | Cylindrical annulus | 1–20 kg | ±0.01–0.02 mm | Planar pick-place, assembly, screwdriving, vertical insertion | 3D orientation; tilted approaches | | **Delta / parallel** | 3–4 | Shallow dome | 0.1–8 kg | ±0.05–0.1 mm | High-speed picking, sorting, packaging | Payload; reach; complex orientation | | **Cartesian / gantry** | 3+ (linear) | Rectangular box | 5–2000+ kg | ±0.01–0.1 mm | Large-area dispensing, CNC, palletizing, machine tending | Footprint; orientation; agility | | **Cylindrical** | 3–4 | Cylindrical | 5–100 kg | ±0.05 mm | Simple tending, dispensing (legacy) | Flexibility; mostly superseded | > **Rule of thumb:** If the task needs arbitrary tool *orientation* in 3D, you need 6 axes. If the work is essentially flat (parts arrive and leave on horizontal surfaces) and you mostly move and press down, SCARA is faster and cheaper. If you're picking small light things very fast off a belt, delta wins. Everything else is a refinement of these three. The remainder of this guide concentrates on the three configurations that dominate new installations — articulated, SCARA, and delta — because cartesian/gantry and cylindrical are either special-purpose (long-stroke, heavy) or legacy. ## The 6-axis articulated arm, anatomy The articulated arm is the one most people picture when they hear "industrial robot." Six revolute joints in series, each adding a degree of freedom, ending in a tool flange. Why six? Because **six degrees of freedom is the minimum needed to place a rigid body at an arbitrary position (X, Y, Z) and an arbitrary orientation (roll, pitch, yaw) anywhere within the envelope.** Three DoF buy you position; the next three buy you orientation. Fewer than six and you lose the ability to reach some poses; more than six (a 7-axis "redundant" arm) buys you the ability to reach the same pose multiple ways — useful for dodging obstacles and singularities, common on cobots, rarer on heavy industrial arms. ### The joints, J1 through J6 Vendors number the axes J1–J6 (FANUC, Yaskawa) or A1–A6 (KUKA) or axis 1–6 (ABB). The roles are universal: - **J1 — base rotation.** The whole arm swivels about a vertical axis. Biggest moment arm, biggest gear, often the slowest in deg/s but it moves the most mass. - **J2 — shoulder.** Pitches the lower arm fore/aft. Carries the entire arm's weight as a cantilever; the highest-torque joint, frequently with a counterbalance (gas spring or mechanical) to offload gravity. - **J3 — elbow.** Pitches the upper arm. Together J1–J3 position the wrist center in space. - **J4 — wrist roll.** Rotates the forearm about its own axis. - **J5 — wrist pitch/bend.** The joint that lets the tool point up, down, or sideways. The classic site of the wrist singularity (more below). - **J6 — tool roll.** Final rotation of the flange about its axis; spins the tool. The clean mental model: **J1–J3 are "the arm" and set *where* the wrist is; J4–J6 are "the wrist" and set *how* the tool is oriented.** Most modern wrists are "in-line" or "hollow-wrist" designs where J4–J6 axes intersect at (or near) a point — the wrist center — which makes the inverse kinematics solvable in closed form and keeps dress packs routed cleanly through the arm. ### Each joint is a motor and a reducer Every axis is a servo motor driving the link through a high-ratio precision gearbox. The gearbox is doing the heavy lifting — literally. Direct-drive servos can't produce the torque these joints need at a reasonable size, so you multiply torque (and divide speed) with a reducer that must also have near-zero backlash, because backlash at a joint becomes positional error at the tool, amplified by the link length. Two gear technologies dominate, and the split is consistent across vendors: - **RV (cycloidal) reducers** — typically Nabtesco RV-series — on the heavy lower axes (J1, J2, J3). They handle high torque, high shock load, and high moment loads with excellent rigidity. This is why they live where the loads are. - **Harmonic (strain-wave) drives** — Harmonic Drive LLC and others — on the lighter wrist axes (J4, J5, J6). They're compact, light, and have zero backlash, ideal where you want low inertia and high precision but don't need to survive a tank running over them. The why and the trade-offs are a guide of their own — see [harmonic & cycloidal gearboxes](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/) — and the motors and drives behind them are covered in [robot actuators](/posts/robot-actuators-ultimate-guide/). The short version: backlash and torsional stiffness of these reducers are the single biggest mechanical contributor to an arm's repeatability and its dynamic accuracy under load. ### The wrist singularity A **singularity** is a configuration where the arm loses a degree of freedom — two joint axes line up, and the inverse kinematics demands an impossible (infinite) joint velocity to maintain the commanded tool path. The most infamous is the **wrist singularity**: when J5 approaches 0°, the J4 and J6 axes become collinear. Both joints now do the same thing, you've effectively lost an axis, and if the tool tries to pass straight through that alignment at speed, J4 and J6 are asked to flip 180° instantaneously. The controller faults out or the arm lurches. There are three classic singularity types on a 6-axis arm: **wrist** (J4/J6 align), **shoulder** (wrist center crosses the J1 axis), and **elbow** (arm fully extended, J2/J3 nearly straight). You design around them — keep the wrist center away from the J1 axis, don't program paths that drive through full extension or J5=0 — and modern controllers offer singularity-avoidance modes that reroute or slow through the danger zone. More on handling these in [motion planning & kinematics](/posts/motion-planning-kinematics-ultimate-guide/). ## SCARA deep-dive SCARA stands for **Selective Compliance Assembly Robot Arm** (sometimes "Articulated"), and the name is the whole design philosophy. It has four axes: two parallel revolute joints rotating about vertical axes (J1 shoulder, J2 elbow) that move the arm in a horizontal plane, a third axis that translates a Z (vertical) shaft up and down, and a fourth that rotates that shaft (theta). The arm is **rigid in the vertical direction and selectively compliant in the horizontal plane** — exactly the property you want for assembly. ### Why selective compliance matters Consider peg-in-hole insertion, the canonical assembly task. You press the peg straight *down* with high force and stiffness — the SCARA's Z axis is stiff, so it pushes hard and tracks vertical position precisely. But the peg is never *perfectly* centered over the hole; there's always some lateral misalignment. A fully rigid machine would jam or shear. The SCARA's horizontal compliance lets the arm deflect slightly sideways, letting chamfers guide the peg into the hole. Stiff where you need force, compliant where you need forgiveness. That's selective compliance, and it's why the SCARA was invented (at Yamanashi University in the late 1970s) specifically for assembly. ### Why it dominates high-speed planar work A flat-world task — pick a component from a tray, move it across the bench, insert or place it — needs exactly four degrees of freedom: X, Y, Z, and rotation about Z. A 6-axis arm doing this job is carrying two extra wrist axes it doesn't need, with their mass and inertia, for no benefit. The SCARA carries only what the task requires, so it's lighter, stiffer, and faster. The numbers are real. A standard SCARA cycle-time benchmark is the **25-305-25 mm move**: lift 25 mm, traverse 305 mm horizontally, lower 25 mm, and return — a round trip representing a typical pick-place. Good SCARAs (Epson G-series, Stäubli TS2, Yamaha YK, Omron eCobra) do this in roughly **0.30–0.45 s**, with repeatability around **±0.01–0.02 mm**. That translates to throughput on the order of: ``` Cycle-time / throughput (SCARA pick-place) ------------------------------------------ Standard move (25-305-25 mm round trip): t_cycle = 0.35 s (typical) Add process dwell (grip + place): t_proc = 0.15 s Effective cycle: t = 0.35 + 0.15 = 0.50 s Throughput = 3600 / t = 3600 / 0.50 = 7200 parts/hour = 120 parts/min ``` Add screwdriving, dispensing, or vision and the dwell grows, but the headline is clear: for fast, repetitive, planar pick-place-and-press work, the SCARA is the cost-effective and high-throughput answer. Reaches typically run **120–1200 mm** radius; payloads **1–20 kg** (most in the 3–10 kg band). > **Rule of thumb:** If your parts arrive and depart on roughly horizontal surfaces and the task is move-and-press, choose SCARA before a 6-axis. You'll get more throughput per dollar and the programming is simpler. Reach for 6 axes only when the approach must be tilted or the orientation is genuinely three-dimensional. ## Delta & parallel robots deep-dive The delta robot is a **parallel** mechanism: instead of a serial chain where each motor carries all the motors downstream of it, three arms reach down from a fixed overhead base to a common moving platform (the "traveling plate"). Each arm is a motor-driven upper link plus a pair of light rods (a parallelogram) that constrain the platform to stay parallel to the base. A fourth, central, telescoping shaft often adds a rotation. ABB's **FlexPicker (IRB 360)** is the archetype; Fanuc, Codian, and others make their own. ### Why parallel kinematics is so fast The magic is **where the mass lives**. In a serial 6-axis arm, the J1 motor must accelerate J2's motor, which must accelerate J3's, and so on — the actuators are part of the moving mass. In a delta, all three motors are bolted to the fixed base and *never move*. The only things that accelerate are three thin carbon-fiber rods and a tiny platform. Moving mass is minimal, so accelerations are enormous: **100–150 m/s²** (roughly 10–15 g) is normal, and that's what produces the eye-watering pick rates. ``` Delta pick rate (idealized) --------------------------- Classic "Adept cycle": 25 mm up, 305 mm across, 25 mm down, return Top deltas: t ≈ 0.25–0.30 s per pick Rate = 60 / t = 60 / 0.30 = 200 picks/min (theoretical) Sustained with vision + conveyor tracking: ~150–180 picks/min typical ``` Parallel kinematics also stack errors *favorably*: in a serial arm, error at J1 propagates through every downstream link; in a parallel arm the three legs average out, and the structure is stiff for its mass. ### The trade-offs: payload and envelope You pay for all that speed in two currencies. **Payload is small** — typically **0.1–3 kg**, with heavy-duty deltas reaching 6–8 kg — because the light rods that make it fast can't carry much. And **the work envelope is a shallow dome**, a flat cylinder maybe **800–1600 mm** in diameter and only **200–500 mm** tall, because the parallelogram geometry constrains where the platform can reach. The delta also struggles with arbitrary orientation; you get rotation about the vertical axis and that's usually it. The result is a robot that is unbeatable at exactly one job: **picking many small, light objects very fast from a moving belt and placing them** — packaging chocolates, sorting pharmaceuticals, loading blister packs, assembling small electronics, primary food handling. Pair it with line-tracking vision and it picks parts off a conveyor without stopping the belt. For anything heavy, large, or requiring tilted approaches, look elsewhere. > **Rule of thumb:** Delta is a specialist, not a generalist. If your part is under ~1 kg, the rate target is above ~80–100 picks/min, and the work is flat-belt pick-place, delta is the answer. Outside that box, a SCARA or 6-axis will serve you better. ## The specs that actually matter Datasheets list dozens of numbers; a handful decide whether the arm does the job. Here are the ones to nail, and the traps in each. ### Payload — and why the catalog number lies The rated payload is the **mass the arm can carry at the flange, including the end effector**, under specified conditions. Two traps: 1. **The gripper counts.** A 10 kg-rated arm carrying a 3 kg gripper has 7 kg left for the part — not 10. Budget the EOAT first. (See [end effectors & grippers](/posts/end-effectors-grippers-ultimate-guide/).) 2. **Inertia, not just mass, is the real limit.** The arm's wrist motors are torque-limited. A compact load close to the flange is easy; the same mass on a long offset arm or eccentric tool may exceed the allowable **moment of inertia** about J4/J5/J6, even though the mass is "within payload." Every vendor publishes a payload diagram (allowable mass vs. center-of-gravity offset) — use it. And dynamics: the load the joints feel is mass times acceleration, not just weight. ``` Effective wrist load with dynamics ---------------------------------- Part + gripper mass: m = 8 kg Gravity: g = 9.81 m/s² Peak path acceleration: a = 20 m/s² (~2 g, aggressive but real) Static force: F_static = m·g = 8 × 9.81 = 78.5 N Dynamic force: F_dyn = m·(g + a) = 8 × 29.81 = 238.5 N The joint sees ~3× the static load at peak accel. Size the arm against F_dyn, then keep ~25% margin. ``` > **Rule of thumb:** Pick an arm whose rated payload is at least 1.3–1.5× your (part + gripper) mass, and confirm the load falls inside the published payload/inertia diagram at your actual tool offset. "It's under the rated payload" is necessary, not sufficient. ### Reach and work envelope **Reach** is usually quoted as the maximum horizontal distance from the J1 axis to the wrist center (or to the flange) — e.g., a FANUC M-20iD/35 reaches ~1831 mm. But the *usable* envelope is smaller and oddly shaped: you can't reach close to the base (the arm folds into itself), you can't reach the full radius at all heights, and singularities carve out regions. Always confirm reach **to the furthest point you must service, with the tool's offset included, in a valid (non-singular) pose**. A robot that "reaches 1.8 m" may not reach your furthest fixture with the gripper pointing the way you need. ### Repeatability, accuracy, and speed Covered in depth in the next section. On the datasheet: **repeatability** (e.g., ±0.03 mm) is the headline; **accuracy** is rarely published and is far worse. **Maximum tool speed** (e.g., 2000 mm/s) and **per-axis speeds** (deg/s) are peak values you'll almost never sustain — cycle time is what matters. ### Axis ranges and mounting Each axis has a **motion range** in degrees (e.g., J1 ±170°, J5 ±120°). These define what poses are reachable and where you'll hit travel limits mid-path. **Mounting** matters too: floor, inverted (ceiling), wall, or angle. Many arms support inverted mounting (great for delta-style overhead picking with a 6-axis) but with reduced payload or restricted axis ranges — check the spec. ### Protection rating and environment **IP rating** (IEC 60529) tells you what the arm survives. Standard arms are around **IP54** (dust-protected, splash-resistant); the wrist is often rated higher (IP65/67) because it's in the spray. Variants exist for: - **Foundry / harsh** — IP67/IP69K, sealed and pressurized, for die-cast and machining splash. - **Washdown / food** — stainless, smooth, food-grade grease, NSF-compliant. - **Cleanroom** — ISO Class 3–5 rated, low particle emission (Stäubli's strength). - **Paint / explosive atmospheres** — ATEX/explosion-proof for spray booths. - **Cold / harsh ambient** — specified operating temperature ranges, typically 0–45 °C standard. Picking the wrong protection class is a common and expensive mistake — an IP54 arm in a washdown food line will die. ### Payload-and-reach selection bands A quick orientation: where common payload/reach combinations land and what arm class they imply. Match your (part + EOAT, with dynamics and margin) load and your furthest serviced point to a band, then shortlist within it. | Payload band | Reach band | Arm class | Representative members | Typical jobs | |---|---|---|---|---| | 1–7 kg | 400–900 mm | Small 6-axis / SCARA | FANUC LR Mate, Stäubli TX2-60, Epson G6 | Bench assembly, small-part tending, packaging | | 6–20 kg | 900–1800 mm | Mid 6-axis | KUKA KR 16, Yaskawa GP25, FANUC M-20iD | Arc welding, general handling, CNC tending | | 20–70 kg | 1700–2700 mm | Large 6-axis | FANUC M-710iC, ABB IRB 4600 | Heavy handling, spot weld, palletizing | | 100–300 kg | 2600–3200 mm | Heavy 6-axis | ABB IRB 6700, KUKA KR 210/300 | Automotive BIW, large-part handling | | 500–1300 kg | 3000–3600 mm | Super-heavy 6-axis | KUKA KR 1000 titan, FANUC M-2000iA | Foundry castings, engine blocks, structures | | 0.1–8 kg | Ø1100–1600 mm | Delta | ABB IRB 360, FANUC M-3iA | High-rate picking, sorting, packaging | ## Repeatability vs accuracy This is the distinction that separates engineers from spec-sheet readers, and getting it wrong wrecks offline programming projects. - **Repeatability** is how closely the robot returns to the *same* commanded point, over and over. It's a measure of *precision* — the tightness of the cluster. Industrial arms are superb here: **±0.02–0.05 mm** for a mid-size 6-axis, **±0.01 mm** for a SCARA. - **Accuracy** (technically *pose accuracy* per ISO 9283) is how close the robot gets to the *commanded* position in real-world coordinates — the distance between where you told it to go and where it actually went. Out of the box, an uncalibrated arm may be accurate only to **±0.5–1.0 mm**, sometimes worse on a large arm. The classic dartboard picture: repeatability is all the darts landing in a tight cluster; accuracy is whether that cluster is centered on the bullseye. A robot can be highly repeatable and badly inaccurate — every dart in the same wrong spot. ### Why the gap exists The robot's controller computes where the flange *should* be from a kinematic model: the nominal link lengths, joint offsets, and zero positions. Reality differs — links are machined to tolerance, gears have compliance, the arm sags under load, joints have small offsets, and thermal expansion shifts everything as the arm warms up. The controller doesn't know about these errors, so its computed pose drifts from the true pose. But because the *same* errors recur every time, the robot still returns to the same *physical* point reliably — hence great repeatability, poor accuracy. ### Why it matters: teach vs. offline programming If you **teach** points by jogging the arm to each location and pressing "record," accuracy is irrelevant — you're commanding physical positions directly and the robot's repeatability brings it back. This is why traditional cells work fine despite poor absolute accuracy. The moment you do **offline programming** — generating the path in CAD/sim (RoboDK, vendor software) from the part's geometry and downloading it — accuracy becomes critical. Now you're commanding coordinates the robot has never physically visited, and its model error shows up as the tool missing the work by a millimeter. The fix is **calibration**: measuring the arm's true kinematics (with a laser tracker or a calibration artifact) and loading the corrected parameters so the model matches reality. A well-calibrated arm can reach **±0.1–0.2 mm absolute accuracy**, which makes offline programming viable. Vendors sell this as "absolute accuracy" options (ABB Absolute Accuracy, FANUC, etc.). > **Rule of thumb:** Teach-and-repeat? Repeatability is your spec. Offline programming, multi-robot interchangeability, or CAD-driven paths? You need calibrated absolute accuracy — budget for it explicitly. ## Controllers & programming The arm is the muscle; the **controller** is the brain, and it's where the vendors really differentiate. The controller is a cabinet containing the servo drives (one per axis), the motion CPU that solves kinematics and plans trajectories in real time, safety hardware, and the I/O that ties the robot to the rest of the cell. The quality of the trajectory generation, the smoothness of blending, the singularity handling, and the integration tooling all live here — not in the steel. ### The teach pendant Every industrial arm ships with a **teach pendant**: a handheld unit with a screen, a jog control, an enabling switch (the three-position deadman you must hold at half-press to move the arm in manual mode), and an emergency stop. You use it to jog the robot, teach points, edit and run programs, and diagnose faults. Modern pendants are tablets (KUKA smartPAD, ABB FlexPendant, FANUC iPendant); the interaction model is universal even if the UI differs. ### Vendor programming languages Each major vendor has its own robot language, and they're more alike than different — point-to-point and linear move commands, I/O, loops, conditionals, frames, and tool/workobject definitions: - **KUKA — KRL** (KUKA Robot Language): Pascal-flavored, with `PTP`, `LIN`, `CIRC` motion commands. - **ABB — RAPID**: structured, readable, with `MoveJ`, `MoveL`, `MoveC`; well-regarded by programmers. - **FANUC — TP (Teach Pendant) + KAREL**: TP is the menu-driven pendant language for most work; KAREL is the lower-level, C/Pascal-like language for complex logic. - **Yaskawa — INFORM**: job-based, with `MOVJ`, `MOVL`. You don't really "choose" a language — you choose a vendor and inherit its language. The languages are simple enough that a competent automation engineer is productive in any of them within days. ### Offline programming and simulation For complex paths, multiple robots, or minimizing line downtime, you program **offline**: build the cell in software, generate and verify paths in simulation, then download. Options: - **Vendor sims** — ABB RobotStudio, KUKA.Sim, FANUC ROBOGUIDE, Yaskawa MotoSim. Highest fidelity for that vendor; the digital twin matches the controller behavior. - **Vendor-neutral — RoboDK** — supports nearly all brands, great for mixed fleets, simpler than the vendor suites, popular with integrators and for machining/additive paths. Offline programming only pays off if your arm is accurately calibrated (see previous section) — otherwise the beautiful simulated path misses the real work by a millimeter. The cell controller above the robot — the PLC, the fieldbus, the SCADA — is its own discipline; see [industrial automation: PLC, SCADA & fieldbus](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/). And the hard real-time motion control under the hood is covered in [real-time control systems](/posts/real-time-control-systems-ultimate-guide/). ## End-of-arm tooling & integration The arm does nothing useful until you bolt a tool to the flange. EOAT is where motion becomes work, and it's the most under-engineered part of most cells. ### The flange and ISO 9409 The tool flange is standardized: **ISO 9409-1** defines the bolt-circle diameters, pilot diameter, and pin location for mechanical interfaces, so a gripper from one vendor bolts to an arm from another. A common designation looks like `50-4-M6` (50 mm pitch circle, 4 holes, M6 thread). Standardizing this is one reason the EOAT ecosystem is so interchangeable. Confirm your arm's flange designation and order tooling (or an adapter plate) to match. ### Payload budgeting at the flange This is the recurring theme: the flange carries the *whole* tool — gripper, fingers/cups, sensors, brackets, cabling, any compliance device or tool changer — and the part. Budget all of it, with the center of gravity offset, against the payload/inertia diagram. ``` EOAT payload budget ------------------- Gripper body: 2.0 kg Fingers + adapter: 0.6 kg Vacuum/sensor + cable: 0.4 kg Mounting plate: 0.3 kg -------------------------------- EOAT subtotal: 3.3 kg Heaviest part: 5.0 kg -------------------------------- Total at flange: 8.3 kg Choose arm rated ≥ 1.3 × 8.3 = ~11 kg → spec a 12–20 kg arm ``` ### Dress packs and cabling The wires, hoses, and cables feeding the tool — power, signal, air, weld gas, dispense — are the **dress pack**, and they are a leading cause of cell downtime. As the arm moves, the dress pack flexes, twists, and rubs; poorly managed, it snags, kinks, or fatigues and fails. Good practice: route through the arm's hollow wrist where available, use proper energy chains and retraction systems, and simulate the dress-pack motion (vendor sims model this) to catch collisions and over-twist before commissioning. On a welding or spot-weld arm the dress pack is half the engineering. Don't treat it as an afterthought. For the business end itself, see [end effectors & grippers](/posts/end-effectors-grippers-ultimate-guide/). ## Motion: trajectory, singularities, TCP How the arm gets from A to B is the controller's job, but the programmer makes choices that decide cycle time and path quality. ### The TCP The **Tool Center Point** is the point on the tool that you actually care about — the tip of a welding torch, the center of a gripper's grasp, the nozzle of a dispenser. You define the TCP relative to the flange (position and orientation), and all motion commands then refer to *that* point, not the flange. Get the TCP definition wrong and every taught point is off. Accurate TCP calibration (the four-point or multi-point touch-up method) is a basic but critical setup step. ### Joint moves vs. linear moves Two fundamental move types, and the difference matters: - **Joint move (`PTP` / `MoveJ` / `MOVJ`)** — the controller drives all joints from start to end angles simultaneously, each taking the most direct angular path. The TCP follows an *unpredictable curved* path through space, but it's the **fastest** way to get from A to B. Use it for free-space repositioning where path shape doesn't matter. - **Linear move (`LIN` / `MoveL` / `MOVL`)** — the controller coordinates all joints so the TCP travels in a **straight line** at a controlled speed. Essential for process moves (welding a seam, dispensing a bead, inserting a part) where the path *is* the point. Slower, and more likely to hit singularities or joint limits because the straight Cartesian path may demand awkward joint configurations. > **Rule of thumb:** Use joint moves for getting *to* the work (fast, cheap) and linear/circular moves for *doing* the work (controlled path). Mixing them well is most of the cycle-time art. ### Blending If the arm stopped dead at every taught point, cycle times would balloon. **Blending** (also "zone," "CNT," "fly-by," "approximate positioning") lets the arm round the corner near a waypoint without stopping — trading exact point-passing for speed. You set the blend radius (e.g., `fine` for exact stop, `z10` for a 10 mm zone in RAPID). Bigger blend zones are faster but cut corners; tune them per move. Aggressive blending on free-space moves and tight/exact positioning on process moves is the usual recipe. ### Singularities in motion As covered in the anatomy section, linear moves are where singularities bite — a straight Cartesian path can drive the wrist through J5=0 and demand infinite joint speed. Mitigations: avoid programming through known singular regions, use the controller's singularity-avoidance modes, reorient the workpiece, or switch a problematic segment to a joint move. The deeper treatment — Jacobians, manipulability, redundancy resolution — is in [motion planning & kinematics](/posts/motion-planning-kinematics-ultimate-guide/). ## Safety & guarding A full-speed industrial arm is a hazard that will kill a person without noticing. A mid-size 6-axis arm slews its tool at 2 m/s carrying tens of kilograms; it has no awareness of a human in its path. Safety is therefore not optional and not improvised — it's standards-driven engineering. ### The standards The governing standard for industrial robot safety is **ISO 10218** (parts 1 and 2: the robot, and the robot system/integration), harmonized with the machinery directive and, in the US, mirrored by **ANSI/RIA R15.06**. Risk assessment per **ISO 12100** drives the design; safety functions are rated to **ISO 13849** (performance levels, PL d/e) or IEC 62061 (SIL). The practical upshot: you do a documented risk assessment, then implement safeguards whose reliability matches the risk. ### Traditional guarding The default for a fast, heavy industrial arm is **physical separation** — keep people out of the robot's reach while it runs: - **Fences / hard guarding** — perimeter fencing around the cell, with interlocked access gates that trigger a safe stop when opened. - **Light curtains and area scanners** — opto-electronic barriers and safety laser scanners that detect entry and stop or slow the robot (SafeMove, FANUC DCS, KUKA SafeOperation zones). - **Safety-rated controllers and stops** — Category 0/1 stops, safe-rated monitored stop, safe speed limits, and software-defined safe zones that the safety PLC enforces independently of the main program. The robot's own safety options (ABB SafeMove, FANUC Dual Check Safety, KUKA.SafeOperation) let you define no-go zones and speed limits in software, monitored by redundant safety hardware — so you can sometimes shrink or eliminate physical fencing while keeping the safety rating. ### vs. cobots **Collaborative robots** take a fundamentally different approach: they're designed (per **ISO/TS 15066**) to operate safely *alongside* people without fences, using force/torque limiting, rounded geometry, and speed-and-separation monitoring so that contact with a human stays below injury thresholds. The trade is steep: cobots are **slow** (often capped well below industrial speeds for safety) and **light** (typically 3–35 kg payload). They're the right tool when human-robot collaboration or rapid redeployment matters more than throughput — and the wrong tool when you need a fenced arm slinging 100 kg at full speed. The full comparison is in [collaborative robots (cobots)](/posts/collaborative-robots-cobots-ultimate-guide/). > **Rule of thumb:** A traditional fenced industrial arm in safe-rated guarding is faster, cheaper per unit of throughput, and higher-payload than any cobot. Choose a cobot for collaboration and flexibility, not because fencing feels like a hassle — and always start the cell design from a risk assessment, not from the robot. ## Selecting & deploying an arm Tie it together with a process. Selection is mostly arithmetic and discipline; the mistakes are almost always skipped steps. ### The selection sequence 1. **Characterize the task.** Process (handling, welding, assembly, dispensing, palletizing), part mass and geometry, presentation, required orientations, throughput target, environment (clean, foundry, washdown). 2. **Choose the configuration.** Flat move-and-press → SCARA. Tiny/light/fast belt picking → delta. Arbitrary orientation, reach, or payload → 6-axis. Long-stroke/heavy → cartesian/gantry. 3. **Size payload with dynamics + EOAT.** Total flange load, with CoG offset, against the payload/inertia diagram, plus 1.3–1.5× margin. 4. **Confirm reach** to the furthest serviced point, tool offset included, in a valid pose. 5. **Set repeatability/accuracy needs.** Teach-and-repeat → repeatability spec. Offline/CAD-driven → calibrated accuracy option. 6. **Pick protection class** for the environment. 7. **Estimate cycle time** in the vendor sim — not on a napkin. 8. **Validate ROI/payback** before the PO. ### Cycle-time estimation Headline speeds are useless; estimate the *actual* cycle, broken into moves and process dwells. ``` Cycle-time estimate (6-axis machine-tending example) ---------------------------------------------------- Approach to part (joint move): 0.8 s Grip (close + confirm): 0.5 s Move to machine (joint + linear): 1.5 s Insert + release (linear): 1.2 s Retract clear: 0.6 s Return to pick (joint move): 1.0 s ---------------------------------------------------- Robot cycle: 5.6 s Machine process time (parallel): 30.0 s → robot is NOT the bottleneck Effective station cycle: 30.0 s → 120 parts/hour If 4 machines tended by 1 robot: Robot busy: 4 × 5.6 = 22.4 s < 30 s machine time → feasible Throughput: 4 × 120 = 480 parts/hour ``` The lesson buried in that math: in machine tending the robot is usually *waiting*, so one robot can tend several machines — that's where the ROI comes from, not from robot speed. ### ROI / payback The standard cut: a single industrial 6-axis arm runs roughly **\$30k–\$80k** for the robot itself; a *complete integrated cell* (tooling, guarding, vision, integration, programming) typically runs **2–4× the robot cost** — call it **\$100k–\$300k+** depending on complexity. Payback comes from displaced labor, higher uptime, consistent quality, and (in tending) one robot doing several machines' worth of loading. ``` Simple payback -------------- Installed cell cost: $180,000 Labor displaced (2 shifts × 1 op): 2 × $55,000/yr = $110,000/yr Quality/scrap savings: $15,000/yr Maintenance + energy: −$12,000/yr Net annual benefit: $113,000/yr Payback = 180,000 / 113,000 ≈ 1.6 years ``` Most justified industrial cells target a payback under ~2–3 years. If your model says 5+, re-examine the throughput assumptions or the scope. ### Real-product spec comparison A snapshot of representative arms across configurations and classes. Treat these as defensible mid-2020s figures for *typical* members of each series; exact variants differ. | Robot | Type | Payload | Reach | Repeatability | Typical use | |---|---|---|---|---|---| | **FANUC LR Mate 200iD/7L** | 6-axis (small) | 7 kg | 911 mm | ±0.01 mm | Bench assembly, tending, packaging | | **FANUC M-20iD/35** | 6-axis (mid) | 35 kg | 1831 mm | ±0.02 mm | Handling, welding, tending | | **ABB IRB 6700** | 6-axis (heavy) | 150–300 kg | 2600–3200 mm | ±0.05 mm | Automotive BIW, spot weld, heavy handling | | **KUKA KR 16 R2010** | 6-axis (mid) | 16 kg | 2010 mm | ±0.04 mm | Welding, handling, machine tending | | **KUKA KR 1000 titan** | 6-axis (super-heavy) | 1000 kg | 3202 mm | ±0.1 mm | Foundry, heavy castings, large parts | | **Yaskawa Motoman GP25** | 6-axis (mid) | 25 kg | 1730 mm | ±0.02 mm | General handling, arc welding | | **Stäubli TX2-60** | 6-axis (precision) | 4.5 kg | 670 mm | ±0.02 mm | Precision assembly, cleanroom, medical | | **Epson G6 SCARA** | SCARA | 6 kg | 600 mm | ±0.015 mm | High-speed assembly, pick-place | | **Yamaha YK500XG** | SCARA | 10 kg | 500 mm | ±0.01 mm | Electronics assembly, screwdriving | | **ABB IRB 360 FlexPicker** | Delta | 1–8 kg | Ø1130–1600 mm | ±0.1 mm | High-speed packaging, sorting | | **FANUC M-3iA/6S** | Delta | 6 kg | Ø1350 mm | ±0.1 mm | Picking, packing, assembly | Use the table to bracket your choice, then go to the vendor's actual datasheet and payload diagram for the specific variant. And remember the running theme: the arm is the easy part. Spend the engineering on tooling, presentation, cycle-time validation, and safety — that's where cells succeed or fail. ## Frequently asked questions **How many axes does an industrial robot arm need?** Six is the standard for a general-purpose articulated arm, because six degrees of freedom is the minimum to reach any position *and* orientation in 3D space. Four (SCARA) is enough for flat-world move-and-press tasks. Seven (redundant) arms — common on cobots — add an extra joint to dodge obstacles and singularities by reaching the same pose multiple ways. Fewer than six and some poses become unreachable. **What's the difference between repeatability and accuracy?** Repeatability is how tightly the robot returns to the *same* commanded point every time (typically ±0.02–0.05 mm — excellent). Accuracy is how close the robot gets to a point specified in real-world coordinates it has never physically visited (often ±0.5–1 mm uncalibrated — much worse). Teach-and-repeat needs only repeatability; offline/CAD-driven programming needs calibrated absolute accuracy. **Should I choose a SCARA or a 6-axis arm?** If your parts arrive and leave on roughly horizontal surfaces and the task is move-and-press (assembly, screwdriving, vertical insertion, planar pick-place), a SCARA is faster, cheaper, and stiffer — it carries only the four axes the task needs. Choose a 6-axis when you need tilted approaches, arbitrary tool orientation, or reach and payload beyond what a SCARA offers. **When does a delta robot make sense?** When you're picking many small, light objects (typically under ~1–3 kg) very fast (often 80–200 picks/min) from a flat or conveyor surface — packaging, sorting, primary food handling. Deltas are unbeatable at that one job because their motors stay on the fixed base, minimizing moving mass. They're poor at heavy loads, large envelopes, and tilted orientations. **What's the real payload I can carry?** Less than the rated number. The rating includes the end-effector, so subtract the gripper, fingers, sensors, and cabling first. Then check the *inertia*: the same mass on a long or offset tool may exceed the allowable moment about the wrist axes even though it's "under payload." And size against dynamics — at 2 g acceleration the joints feel ~3× the static load. Aim for 1.3–1.5× margin and confirm against the vendor's payload/inertia diagram. **Why do robots have singularities and how do I avoid them?** A singularity is a configuration where two joint axes align and the arm loses a degree of freedom, requiring impossible (infinite) joint speeds to maintain a Cartesian path. The wrist singularity (J4/J6 collinear when J5≈0) is the common one. Avoid them by not programming linear paths through known singular regions, keeping the wrist center off the J1 axis, avoiding full arm extension, using the controller's singularity-avoidance modes, or switching the problem segment to a joint move. **What programming language will I use?** Whatever your vendor uses — you choose the brand and inherit the language. KUKA uses KRL, ABB uses RAPID, FANUC uses TP plus KAREL, Yaskawa uses INFORM. They're all simple structured languages with point-to-point, linear, and circular moves, I/O, and frames; a competent engineer is productive in any of them within days. For complex or multi-robot work, program offline in the vendor sim or a neutral tool like RoboDK. **Do I need a fence, or can I use a cobot?** A traditional industrial arm needs guarding — fences, interlocked gates, light curtains, or safe-rated software zones — under ISO 10218, driven by a risk assessment. Cobots (ISO/TS 15066) can run fenceless via force/speed limiting, but they pay for it in speed and payload. Choose a cobot when human collaboration or fast redeployment genuinely matters; otherwise a fenced industrial arm gives far more throughput per dollar. **What does a complete robot cell cost?** The arm itself is roughly \$30k–\$80k for a typical 6-axis. The *integrated cell* — tooling, guarding, vision, controls, integration, and programming — usually runs 2–4× the robot cost, so \$100k–\$300k+ depending on complexity. Most justified cells target payback under 2–3 years, often driven by one robot tending several machines while they run. **What IP rating do I need?** Depends on the environment. Standard arms are around IP54 (dust-protected, splash-resistant), with the wrist often higher (IP65/67). For die-cast/machining splash use a foundry-spec IP67/IP69K arm; for food lines a washdown stainless variant; for spray booths an ATEX/explosion-proof variant; for semiconductor/medical a cleanroom-rated arm (Stäubli is the specialist). Matching protection to environment is a common, expensive thing to get wrong. **What's the difference between a joint move and a linear move?** A joint move (PTP/MoveJ) drives all axes to their target angles simultaneously — fastest through free space, but the tool follows an unpredictable curved path. A linear move (LIN/MoveL) coordinates the joints so the tool tip travels in a straight line at controlled speed — essential for process paths like welding or dispensing, but slower and more prone to singularities. Use joint moves to get *to* the work and linear moves to *do* the work. **Which vendor should I buy?** For most articulated-arm work, FANUC, ABB, KUKA, and Yaskawa are all excellent and the choice often comes down to existing fleet standardization, local integrator support, and price. Go to a specialist when the task is specialized: Stäubli for precision/cleanroom, Epson/Yamaha/Mitsubishi/Omron for SCARA, ABB/FANUC/Codian for delta. There are no bad big-vendor arms — only mismatches between arm and task. ## Changelog - **2026-05-26** — Initial publication. --- # Collaborative Robots (Cobots): The Ultimate Guide URL: https://blog.robo2u.com/posts/collaborative-robots-cobots-ultimate-guide/ Published: 2026-05-23 Updated: 2026-06-20 Tags: cobots, collaborative-robots, universal-robots, iso-ts-15066, power-force-limiting, human-robot-collaboration, force-control, automation, guide Reading time: 38 min > An engineer's deep dive into collaborative robots — the four ISO/TS 15066 collaboration modes, power & force limiting biomechanics, joint torque sensing, force control, programming, risk assessment, and a real-product selection table for 2026. There is a stubborn myth in this industry that a "cobot" is a category of robot — a small, rounded, friendly arm you can buy and then, by virtue of having bought it, work safely beside a human. That myth has sold a lot of hardware and produced a lot of badly deployed cells. The reality is narrower and more useful: collaboration is a property of the *application*, established by a *risk assessment*, achieved through one of four *safety-rated modes*. The robot is just the enabling hardware. This guide is the long version, written for the people who actually have to make the decision and sign off on the cell: the integrators, the controls engineers, the manufacturing engineers, and the safety engineers who own the risk assessment. We'll go through what "collaborative" really means under ISO 10218 and ISO/TS 15066, take power & force limiting (PFL) apart down to the biomechanical force tables, look at the joint hardware that makes contact sensing possible, cover force control and programming, and then get honest about deployment, applications, ROI, and product selection. Real numbers with units. Real products. Opinions with the reasons attached. **The take**: A cobot is not inherently safe — it is a robot *capable of being run in a collaborative mode*, and whether your specific cell is safe depends entirely on the risk assessment of the robot, the end effector, the workpiece, the process, and the speed you actually run. The single biggest lie in cobot marketing is "no fencing required." Sometimes true, often not, and never something you get to assume. Buy the safety functions and the sensing, then earn the fence-free deployment with a CE-marked risk assessment — or accept that most "cobots" in production today run fenced, at full speed, as cheap, easy-to-program light industrial arms. Both outcomes are fine. Pretending they're the same thing is not. Companion reading: [industrial robot arms](/posts/industrial-robot-arms-ultimate-guide/), [robot actuators](/posts/robot-actuators-ultimate-guide/), [robot sensors](/posts/robot-sensors-ultimate-guide/), and [end effectors & grippers](/posts/end-effectors-grippers-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [What makes a robot "collaborative"](#what-collaborative) 3. [Cobot vs traditional industrial arm](#cobot-vs-industrial) 4. [The four collaboration modes (ISO 10218 / ISO/TS 15066)](#four-modes) 5. [Power & force limiting deep-dive](#pfl-deep-dive) 6. [How cobots sense contact](#sensing-contact) 7. [Cobot joint hardware](#joint-hardware) 8. [Force control & compliance](#force-control) 9. [Programming cobots](#programming) 10. [Risk assessment & deployment](#risk-assessment) 11. [Real applications & ROI](#applications) 12. [The 2026 cobot market & landscape](#market) 13. [Selecting a cobot](#selecting) 14. [Frequently asked questions](#faq) ## Key takeaways - **"Collaborative" describes an application, not a robot.** It is established by a risk assessment (ISO 12100) and achieved through one of four collaboration modes defined in ISO 10218-2 and detailed in ISO/TS 15066. A robot only earns the label in context. - The four modes are **safety-rated monitored stop (SRMS)**, **hand guiding (HG)**, **speed and separation monitoring (SSM)**, and **power and force limiting (PFL)**. Only PFL permits intended or incidental contact with a moving robot. The other three keep human and moving robot apart in space or time. - **Cobots trade speed and payload for safety and redeployability.** Typical PFL cobots run TCP speeds of 250–1,000 mm/s in collaborative mode versus 2,000+ mm/s for a fenced industrial arm of the same class, with repeatability around ±0.03–0.10 mm versus ±0.02 mm. - **PFL is governed by biomechanics, not robot specs.** ISO/TS 15066 publishes force and pressure limits for 29 body regions, split into **quasi-static** (clamping/trapping) and **transient** (free impact) thresholds. The skull/forehead is the most restrictive: ~130 N quasi-static. Validation is done physically with a calibrated force gauge and pressure-indicating film. - **Contact sensing is the enabling technology.** Joint torque sensors (Franka, KUKA iiwa, FANUC CRX, Doosan, some Techman) give clean, low-latency external-force estimates; motor-current estimation (early Universal Robots, many lower-cost cobots) is cheaper but noisier and worse at low speed. - **The cobot joint is a modular actuator**: a frameless BLDC motor + a strain-wave (harmonic) gearbox + dual encoders (motor-side and output-side) + often a torque sensor + brake, all in one cartridge. See [robot actuators](/posts/robot-actuators-ultimate-guide/) and [gearboxes](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/). - **Backdrivability and torque sensing enable hand guiding and compliance.** Impedance/admittance control lets you push the arm around for teaching and lets the arm yield to contact instead of fighting it. - **Programming is the real cobot revolution.** Graphical teach pendants (URScript under the hood, FANUC's tablet TP, Techman's flow UI), teach-by-demonstration, and ecosystems like UR+ collapsed deployment time from weeks to days. That, more than safety, is why cobots sold. - **The end effector and workpiece are part of the safety case.** A "collaborative" robot holding a knife, a hot part, or a sharp sheet-metal blank is not a collaborative application. ISO/TS 15066 force limits assume blunt, non-hazardous contact. - **Most cobots in production run fenced at full speed.** Collaborative-rated does not mean collaboratively *operated*. Plenty of cells use a cobot purely for its easy programming and redeployability, then guard it like any other robot and run it fast. - **Higher-payload cobots arrived.** The UR20 (20 kg) and UR30 (30 kg), FANUC CRX-25iA (25 kg), and Doosan H-series (up to 25 kg) pushed PFL into palletizing and heavier machine tending, where cobots now genuinely compete. - **The humanoid wave shares the cobot's DNA.** Torque-controlled, backdrivable, force-limited joints are exactly the cobot actuator scaled and re-arranged. See [humanoid robot hardware](/posts/humanoid-robot-hardware-ultimate-guide/). - **Select on the safety case first, then payload/reach.** A clean PFL deployment depends as much on your gripper, part, and acceptable speed as on the arm. Pick the arm that meets payload/reach with margin, then prove the collaborative mode. ## What makes a robot "collaborative" Start with the definition, because almost everything that goes wrong downstream traces back to getting this wrong. A **collaborative operation** is a state in which a purpose-designed robot system works in direct cooperation with a human within a defined collaborative workspace. That's the language of ISO 10218. Note what it does *not* say: it does not say "a robot under 10 kg payload," or "a robot with rounded edges," or "a robot you bought from a company that uses the word cobot in its marketing." Collaboration is a *mode of operation* in a *defined workspace*, validated by a *risk assessment*. > **Safety rule:** The robot is never certified "collaborative" on its own. The *application* — robot + end effector + workpiece + process + layout + speed — is what a risk assessment can declare collaborative. A cobot arm shipped from the factory is collaborative-*capable*, nothing more. ### The UR origin story The modern cobot starts in Odense, Denmark, in the mid-2000s. Three researchers — Esben Østergaard, Kasper Støy, and Kristian Kassow — founded Universal Robots in 2005 on a thesis that the robotics industry had it backwards: robots were powerful, fast, expensive, dangerous, and miserable to program, so they sat behind fences serving high-volume lines, and the vast middle of manufacturing — the small and medium shops doing short runs — couldn't justify them. UR's bet was to invert every one of those properties. Make the arm light (the UR5 launched in 2008 at ~18 kg arm mass for a 5 kg payload). Make it slow enough to be safe. Make it programmable by a shop-floor operator with a 3D-graphical pendant instead of a robotics PhD. And — the killer feature — make it monitor its own forces so it could stop on contact, which under the right risk assessment meant it could run without a fence. That last property is the one everyone fixated on, and it's the one most misunderstood. UR didn't invent a "safe robot." They built a robot with *safety functions* — force/torque monitoring, safety-rated speed and position limits — that an integrator could use to *construct* a safe application. The robot enabled collaboration. It did not guarantee it. ### The myth that cobots are inherently safe Here is the dangerous mental shortcut: "It's a cobot, so I can stand next to it." No. A cobot running PFL at low speed with a smooth, rounded, lightweight payload and no pinch points is genuinely safe to touch — that's the design intent and it works. The *same* cobot moving 1,000 mm/s with a 5 kg steel fixture, or holding a deburring spindle, or carrying a sheet-metal blank with a 0.2 mm edge, is a hazard like any other robot. The arm hasn't changed. The application has. > **Safety rule:** Speed, payload, end-effector geometry, and workpiece hazards all leave the "collaborative" envelope independently. Any one of them can turn a collaborative-rated arm into a machine that needs guarding. Re-run the risk assessment whenever any of them changes. The practical consequence: a huge fraction of installed cobots run *guarded* — behind light curtains, area scanners, or physical fencing — at or near full speed, used purely as cheap, fast-to-deploy, redeployable light industrial robots. That is a completely legitimate use. It is just not "collaborative operation," and calling it that muddies the risk assessment. ## Cobot vs traditional industrial arm The honest framing is a set of tradeoffs, not a winner. For the full treatment of conventional six-axis arms, see the [industrial robot arms guide](/posts/industrial-robot-arms-ultimate-guide/). Here's how the two classes actually differ. | Attribute | PFL cobot (e.g. UR10e, FANUC CRX-10iA) | Industrial arm (e.g. FANUC M-10iD, ABB IRB 1300) | |---|---|---| | Payload (typical class) | 3–30 kg | 5–1,300 kg | | TCP speed, collaborative mode | 250–1,000 mm/s | n/a | | TCP speed, full / guarded | 1,000–2,000 mm/s (cobot guarded) | 2,000–8,000+ mm/s | | Repeatability | ±0.03–0.10 mm | ±0.02–0.05 mm | | Arm mass : payload ratio | ~3:1 to 5:1 | ~10:1 to 30:1 | | Fencing | Often none (PFL) or reduced | Hard guarding / interlocked enclosure | | Force/torque sensing | Built in (every joint or wrist) | Optional, add-on F/T sensor | | Programming | Graphical, operator-level | Vendor language + trained programmer | | Redeployability | High — wheel it to the next job | Low — bolted, fenced, re-engineered | | Mounting | Floor, wall, ceiling, table, cart | Typically heavy floor pedestal | | Cost (arm + controller) | €20k–€55k | €25k–€120k+ (then add guarding) | | Cell integration cost | Low — guarding often minimal | High — guarding, safety PLC, layout | | Duty cycle / lifetime | Good; gearing sized lighter | Excellent; built for 24/7 at speed | A few things worth stating plainly: **The cobot's payload-to-mass ratio is its core compromise.** To be safe on contact, the arm must be light and the joints relatively low-inertia, which means smaller motors and lighter gearing for a given payload. That's why a 10 kg cobot weighs ~33 kg while a 10 kg industrial arm might weigh 130 kg — the industrial arm's mass buys it stiffness, speed, and brutal duty cycle that the cobot deliberately trades away. **Repeatability is close but not equal.** A UR10e is ±0.05 mm; a comparable FANUC industrial arm is ±0.02–0.03 mm. For most assembly and tending that gap is irrelevant. For precision insertion or laser work it can matter, and you'd compensate with force control or vision rather than raw repeatability. **The economics flip on guarding and engineering, not the arm.** A cobot arm isn't dramatically cheaper than a small industrial arm. The savings are in the *cell*: less guarding, less safety-PLC integration, less layout engineering, and dramatically less programming time. On a short-run or frequently-changed job, redeployability is the whole value proposition. > **Safety rule:** If you intend to run a cobot guarded at full speed, you've bought an industrial arm with a nice pendant. Size it, fence it, and validate it like one — don't let "it's a cobot" shortcut the guarding decision. ## The four collaboration modes (ISO 10218 / ISO/TS 15066) This is the conceptual core of the entire field, and it is where most confusion lives. There are exactly **four** collaborative methods. They are defined in ISO 10218-2 (the system/integration standard) and elaborated in **ISO/TS 15066:2016**, the technical specification that put real numbers behind collaborative operation. The 2025 revision of ISO 10218-1/-2 folded much of TS 15066's content into the normative standards, but the four modes are unchanged. A cell can use one mode, or several in combination (e.g. SSM during transit, PFL at the workstation). They are not a ranking — each suits different applications. | Mode | What it controls | Human–robot contact | Typical hardware | Best for | |---|---|---|---|---| | **Safety-rated monitored stop (SRMS)** | Robot is stationary (Cat 2 stop, power on) when human is in workspace | Robot must be stopped before human is present | Safety scanner / light curtain + safe-stop function | Manual load/unload of a station; robot resumes when human leaves | | **Hand guiding (HG)** | Operator physically moves the robot via a safety-rated guiding device | Yes — operator holds an enabling/guiding handle | Hand-guide device, enabling switch, safe speed monitoring | Teaching, heavy-part assist, lift-assist devices | | **Speed & separation monitoring (SSM)** | Robot speed scaled to distance from human; full stop if too close | No — separation maintained at all times | Safety laser scanners / 3D vision zones + safe speed | Shared workspace, sequential collaboration, transit at speed | | **Power & force limiting (PFL)** | Contact forces/pressures kept below biomechanical limits | Yes — intended or incidental contact permitted | Joint torque / current sensing, safe force monitoring | True side-by-side work, light assembly, tending | ### Safety-rated monitored stop (SRMS) The simplest and most common in practice. The robot does its work autonomously; when a human needs to enter — to load a part, clear a jam, inspect — a safety device (scanner, light curtain) triggers a **safety-rated stop with power maintained** (effectively a Stop Category 2 per IEC 60204-1). The robot holds position, motors energized, monitored. When the human clears the zone, it resumes without a re-home. This is collaboration by *time-sharing the space*: human and robot are never both moving in the workspace simultaneously. Cheap, robust, easy to validate. It's how the majority of "collaborative" machine-tending cells actually work. ### Hand guiding (HG) The operator grasps a safety-rated guiding device and physically moves the robot. The robot is in a safe-speed-monitored state; let go (or release the enabling switch) and it stops. This is the basis of **lift-assist** and **direct teaching**, and it's what makes "grab the arm and show it the path" possible. It depends utterly on backdrivability and torque sensing — see [force control](#force-control) below. ### Speed and separation monitoring (SSM) The robot and human share the workspace, but a safety system continuously measures the **separation distance** and scales robot speed accordingly: full speed when far, slower as the human approaches, full stop below a protective separation distance. Implemented with safety laser scanners (SICK microScan3, Omron OS32C) defining warning and protective fields, or increasingly 3D safety vision (e.g. Veo Robotics-style systems, now part of broader offerings). The math behind the protective separation distance \(S_p\) comes straight from ISO/TS 15066: ```text Protective separation distance (ISO/TS 15066 §5.5.4): S_p(t0) = S_h + S_r + S_s + C + Z_d + Z_r S_h = human movement contribution during robot stopping = ∫ v_h dt (use 1600 mm/s if directed speed unknown) S_r = robot movement during reaction time T_r S_s = robot stopping distance during T_s (braking) C = intrusion distance (per ISO 13855; e.g. 1200 mm for hands) Z_d = position uncertainty of the human (sensor) Z_r = position uncertainty of the robot Worked example (hand approach, modest robot): v_h = 1600 mm/s, T_r = 0.10 s, T_s = 0.25 s, robot speed v_r = 500 mm/s S_h = 1600 * (0.10 + 0.25) = 560 mm S_r = 500 * 0.10 = 50 mm S_s = 0.5 * 500 * 0.25 = 63 mm (linear decel approx) C = 1200 mm (hand intrusion, sensor resolution dependent) Z_d + Z_r ≈ 100 mm S_p ≈ 560 + 50 + 63 + 1200 + 100 = 1973 mm (~2.0 m) ``` That ~2 m number is why SSM cells need real floor space, and why people are often surprised that "collaborative" can mean "keep two meters apart." The dominant term is the human's own approach speed and the standardized intrusion distance. ### Power and force limiting (PFL) The only mode where a *moving* robot is permitted to *contact* a human, intentionally or by accident, because the system guarantees that any contact stays below biomechanical injury thresholds. This is the mode people mean when they say "cobot." It's also the hardest to validate, and it gets its own section. ## Power & force limiting deep-dive PFL is where the engineering gets genuinely interesting, because the limits aren't set by the robot — they're set by *human pain and injury physiology*, codified in **ISO/TS 15066:2016 Annex A**. ### Quasi-static vs transient contact The standard splits contact into two physically distinct cases, and they matter enormously: - **Quasi-static (clamping/crushing) contact:** the body part is trapped between the robot and a fixed surface, so the force can be sustained. This is the dangerous case — there's no escape, and force builds. Limits are *lower*. - **Transient (dynamic/free-impact) contact:** the robot hits a body part that is free to recoil or move away. The contact is brief (typically modeled at ≤0.5 s). The body absorbs energy and moves; injury threshold is *higher* — roughly **2×** the quasi-static force limit for most regions. > **Safety rule:** Design out the clamping case first. A pinch point between the robot and a fixed table, wall, or fixture is governed by the *quasi-static* limits — the strict ones — and no amount of force monitoring undoes a geometric trap. Eliminate fixed surfaces near the path before you tune forces. ### The biomechanical limit tables ISO/TS 15066 Annex A specifies, for **29 specific body regions**, a maximum permissible **pressure** (N/cm²) and **force** (N). Pressure governs local tissue/contusion injury; force governs the whole-body push. Both must be satisfied. Representative quasi-static values: | Body region | Quasi-static force limit (N) | Quasi-static pressure (N/cm²) | |---|---|---| | Skull / forehead | 130 | 130 | | Face | 65 | 110 | | Neck (sides/muscle) | 150 | 140 | | Back / shoulders | 210 | 160 | | Chest (sternum) | 140 | 120 | | Abdomen | 110 | 110 | | Hand / fingers (non-dominant) | 140 | 190–280 | | Upper arm / elbow joint | 150 | 190 | | Forearm / wrist joint | 160 | 180–190 | | Thigh / kneecap | 220 | 220 | | Lower leg (shin) | 130 | 220 | Two engineering takeaways. First, the **skull and face are the binding constraints** for most overhead or eye-level work — 130 N quasi-static is not much. Second, **pressure is often the real limiter, not force.** A 130 N contact through a sharp edge or small radius concentrates pressure far above the limit even though the total force is fine. This is why PFL applications mandate rounded, blunt, large-radius contact surfaces on the arm *and* the end effector. Transient limits are roughly double, but you don't get to assume transient. If the body part can be trapped, it's quasi-static — full stop. ### Building a PFL force budget You work the problem backward from the limit. The robot's effective contact force depends on its speed and its *effective mass* at the contact point (a function of robot inertia, payload, and configuration): ```text PFL contact-energy / force budget (simplified, transient case) Effective mass at TCP: m_eff = M / 2 + m_L (M = lumped moving robot mass, m_L = payload + end-effector mass) Transient contact treated as a spring collision: F_max = v_rel * sqrt(k * m_eff) v_rel = relative speed at contact (m/s) k = effective contact stiffness (N/m), body-region dependent (ISO/TS 15066 Annex tabulates spring constants, e.g. ~75 N/mm for the back, ~150 N/mm for the skull region) Worked example — limit chest force to 280 N transient: k_chest ≈ 25 N/mm = 25,000 N/m m_eff ≈ 4 kg (small cobot + light tool) F_max = 280 N target => v_rel = F_max / sqrt(k * m_eff) = 280 / sqrt(25000 * 4) = 280 / 316 ≈ 0.89 m/s (≈ 885 mm/s) So below ~0.9 m/s, a chest impact stays under the transient limit for this effective mass. Halve m_eff or k and the safe speed rises; add payload mass and it falls. This is why heavier payloads force lower collaborative speeds. ``` This is the crux of why **higher payload forces lower collaborative speed**: \(F_{max}\) scales with \(\sqrt{m_{eff}}\), so doubling effective mass cuts your safe speed by ~30%. A 20 kg-payload cobot carrying a real load simply cannot move fast and stay collaborative — which is why even the UR20/UR30 typically run PFL work at reduced speed and reserve full speed for guarded operation. ### Validation: you measure it, you don't calculate it Calculation gets you a design target. **Certification requires physical measurement**, and this is non-negotiable in a real CE/risk-assessment process. The instrument is a **biofidelic force/pressure measurement device** — a spring-and-load-cell apparatus with a calibrated spring constant matching the relevant body region (commercial units: GTE Industrieelektronik / PILZ PRMS, or the CBSF-75 / "Cobot pressure-and-force measurement system"). You command the robot to drive into the device at the worst-case point in the trajectory, and read peak force. Pressure is measured separately with **pressure-indicating film** (Fujifilm Prescale) placed over the contact patch: the film changes color in proportion to local pressure, and you scan it to read the distribution. This catches the sharp-edge / small-radius problem that a single-axis force gauge misses entirely. > **Safety rule:** Measure at the *worst* point in the trajectory and the *worst* configuration, not a convenient one. Effective mass and speed vary across the workspace; the binding case is usually full extension at the highest-speed segment near a pinch geometry. One green measurement at the home position proves nothing. ## How cobots sense contact PFL and hand guiding both depend on the robot *knowing the external force* applied to it, continuously and fast. There are two fundamentally different ways to get that, and the choice ripples through cost, performance, and which applications are viable. For the broader sensor landscape see the [robot sensors guide](/posts/robot-sensors-ultimate-guide/). ### Motor-current estimation (the cheap way) If you know the current in each joint motor, you know the motor torque (torque ≈ \(k_t \times I\)). Subtract the torque you *expected* for the commanded motion — from a dynamic model of the arm (inertia, gravity, friction, Coriolis) — and the residual is the **external torque**. Map joint torques through the Jacobian and you get the external force at the TCP. No extra sensors; it's "free." The catch is everything that corrupts the estimate: **gearbox friction** (especially the stiction and hysteresis of strain-wave gears), unmodeled payload inertia, temperature drift, and the fact that current sensing is upstream of the gearbox so it can't see what the gearbox eats. At low speed, friction dominates and the external-force estimate gets noisy — exactly the regime where gentle contact happens. Early Universal Robots (CB-series) used this approach. It works, but its force resolution and low-speed sensitivity are modest, which forces conservative limits. See [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/) for how the current loop and torque estimation actually work. ### Joint torque sensors (the good way) Put a dedicated **torque-sensing element on the output side of each joint** — typically a strain-gauged or optical flexure — and you measure the actual joint torque *after* the gearbox, directly. Subtract the model-predicted torque and the residual external torque is far cleaner: gearbox friction is now inside the measurement, not corrupting it. This is the architecture of the **KUKA LBR iiwa** (the pioneer — torque sensors in all 7 joints), **Franka Emika** (link-side torque sensors, exceptional sensitivity), **FANUC CRX** (torque sensing enabling its smooth contact behavior), **Doosan** (torque sensors in all six joints), and **Techman** on some models. The payoff: fine force control, reliable low-speed contact detection, true impedance control, and the ability to do delicate force-controlled assembly (insertion, polishing) that current-estimation cobots struggle with. The cost: torque sensors add money, complexity, and a calibration burden to every joint. That's the central cost/performance fork in cobot design. ### Wrist force/torque sensors A third option: a single six-axis **F/T sensor at the wrist** (ATI, OnRobot HEX, Bota Systems). This measures force at the tool precisely — great for assembly and polishing — but it only sees forces *through the flange*. A contact on the *elbow* or *forearm* link is invisible to a wrist sensor. So wrist F/T is excellent for process force control but cannot, by itself, provide whole-arm PFL safety. Many cells use joint sensing for safety *and* a wrist sensor for fine process control. ## Cobot joint hardware Open up almost any modern cobot and you find the same elegant idea repeated six (or seven) times: a self-contained **modular actuator cartridge**. Understanding it explains nearly every spec on the datasheet. The deep treatments live in [robot actuators](/posts/robot-actuators-ultimate-guide/), [gearboxes](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/), and [encoders](/posts/encoders-ultimate-guide/); here's how they combine in a cobot joint. ### The five ingredients 1. **Frameless BLDC motor.** A pancake permanent-magnet synchronous motor, rotor bonded to the joint shaft, stator into the housing — no separate motor housing or coupling, saving mass and length. Driven by field-oriented control. High pole count for smooth low-speed torque. 2. **Strain-wave (harmonic) gearbox.** The defining cobot reduction: 50:1 to 160:1 in a single thin, coaxial, near-zero-backlash stage. Zero backlash matters because backlash destroys both repeatability and force-sensing fidelity. The downside — strain-wave gears have meaningful friction and some flexibility — is exactly what torque sensing and good models exist to handle. (Cycloidal drives show up in the heavier base joints of larger cobots for higher torque density.) 3. **Dual encoders.** A **motor-side** encoder (high resolution, for the fast commutation/velocity loop) *and* an **output-side** encoder (after the gearbox, for true joint angle). The output encoder is what lets the controller measure actual joint position despite gearbox flex and lash — essential for repeatability and for clean torque estimation. See [encoders](/posts/encoders-ultimate-guide/). 4. **Joint torque sensor** (on torque-sensing cobots). The flexure + strain gauges or optical element discussed above, integrated into the joint output. 5. **Safety brake.** A spring-applied, electrically-released holding brake so the arm holds position (and a payload) on power loss — and so a safe-stop can hold against gravity. Wrap that in a hollow-shaft design for cable routing and you have one joint. Stack six, scale the sizes down toward the wrist, and you have the arm. ### Why this architecture won It's *manufacturable and serviceable*. Identical joint modules in a few sizes mean fewer part numbers, easier repair (swap a joint cartridge, not the arm), and a clean mapping from joint size to torque rating. It also makes the control problem tractable: each joint is a well-characterized torque source with known dynamics, which is what makes whole-arm dynamic modeling — and therefore force estimation — feasible. The tradeoff returns us to the central compromise: strain-wave gears and pancake motors are *light* but not as stiff or as overload-tolerant as the big spur/bevel trains in industrial arms. That's the hardware reason cobots run slower and carry less. ## Force control & compliance Sensing contact is half the story. *Responding* to it — yielding, pushing with controlled force, being shoved into place by an operator — is force control, and it's what separates a robot that merely *detects* a collision from one that *collaborates*. The actuator-level foundations are in the [robot actuators guide](/posts/robot-actuators-ultimate-guide/). ### Impedance vs admittance control Two ways to make an arm behave springy and compliant: - **Impedance control:** measure *position/velocity* deviation, command *force/torque*. The arm behaves like a programmable spring-damper: push it and it yields with a stiffness you set. Needs good torque control (and ideally torque sensing) at every joint. This is the KUKA iiwa / Franka native mode — you can set a 50 N/m "soft" wrist that floats, or a stiff one that resists. Naturally stable in contact, excellent for delicate insertion. - **Admittance control:** measure *force* (e.g. wrist F/T sensor), command *position*. The arm reads the force you apply and moves accordingly. Works on a position-controlled robot with one F/T sensor — cheaper to retrofit — but can go unstable against stiff environments and feels less natural at low force. Most current-estimation cobots do a pragmatic admittance-flavored compliance; torque-sensing cobots do real joint-level impedance. The difference is palpable when hand-guiding: a Franka or iiwa floats like it's weightless; a position-controlled arm with admittance feels like pushing through molasses by comparison. ### Why backdrivability matters **Backdrivability** is the ability to move the joint by pushing on the output — i.e. the gearbox doesn't lock you out. It's set mostly by gear ratio and friction: low-ratio, low-friction drives backdrive easily; high-ratio worm or highly-loaded strain-wave gears resist. Backdrivability matters for two reasons: 1. **Hand guiding feels natural** when the arm offers little resistance and the controller cancels gravity and friction. 2. **Contact is gentler** — a backdrivable joint can physically give way during the milliseconds before the controller even reacts, providing a layer of *intrinsic* (mechanical) compliance on top of the *active* (controlled) compliance. Franka Emika built much of its reputation on exceptional backdrivability and torque control; that's why it became the research-and-fine-manipulation darling. Strain-wave gears aren't naturally very backdrivable, so torque sensing + active compensation does the heavy lifting. ### Lead-through (teach-by-demonstration) Hand guiding's everyday payoff: free-drive mode. Press the button, the arm goes compliant and gravity-compensated, you physically drag the TCP through the path, releasing waypoints as you go. Thirty seconds to teach a pick pose that would take minutes of jogging. It's a direct consequence of torque sensing + compliance + a released brake, and it's one of the genuine usability leaps cobots delivered. ## Programming cobots If safety is the headline, **programming is the actual reason cobots conquered the SME market.** A traditional industrial robot needs a trained programmer and days of work; a cobot can be deployed by a process engineer in an afternoon. That's the disruption. ### Graphical / no-code teaching Universal Robots' **PolyScope** pendant pioneered the model: a 3D-graphical, flowchart-style interface where you build a program by adding nodes (Move, Set, Wait, If, gripper actions) and teach waypoints by free-driving or jogging. No text. FANUC's **CRX tablet TP** uses drag-and-drop icon programming; **Techman's TMflow** is a visual node-graph; **Doosan's DART** and **ABB Wizard** (block-based, Scratch-like) follow the same philosophy. An operator who can use a smartphone can build a useful pick-and-place in an hour. ### Scripting underneath The graphical layer sits on a real language. UR's is **URScript** — a Python-like scripting language you can write directly for anything the GUI can't express (custom math, socket comms, complex flow). Example of the readable, approachable style: ```python # URScript: force-controlled insertion until 30 N reached, then settle def insert_part(): # move to approach pose above the hole movej(p[0.40, -0.20, 0.30, 0, 3.14, 0], a=1.0, v=0.25) # enable force mode: push down (Z) with up to 30 N, # stay compliant in X/Y so the part self-aligns force_mode(tool_pose(), # task frame = tool [1, 1, 1, 0, 0, 0], # compliant axes: X,Y,Z [0, 0, -30, 0, 0, 0], # 30 N downward in Z 2, # type: simple force [0.05, 0.05, 0.15, 0.17, 0.17, 0.17]) # limits while force() < 30: sync() end end_force_mode() set_digital_out(0, True) # signal "seated" end ``` That snippet — compliant in two axes, force-controlled in the third — is a textbook PFL-era assembly trick: let the part find the hole instead of demanding perfect position. It's only practical *because* of force sensing. ### The ecosystem: UR+ and friends UR's second masterstroke was **UR+**, a certified-hardware-and-software marketplace: grippers, vision, screwdrivers, sensors, and "URCaps" plugins that drop into PolyScope as native nodes. Plug in a Robotiq gripper and a "Grip" node appears in your program — no driver wrangling. FANUC, Techman, and Doosan all built analogous partner ecosystems. This ecosystem effect is a real moat: it's why UR's market share outlasted its technical lead. ### Offline, simulation, and ROS For complex cells there's offline programming and digital twins (URSim, RoboDK, vendor sims) and, increasingly, **ROS / ROS 2 drivers** for research and advanced integration. Most production cobot work, though, still happens on the pendant — and that's a feature, not a limitation. ## Risk assessment & deployment This is where good intentions meet legal and physical reality. Deploying a cobot collaboratively is an *engineering process with a paper trail*, not a purchase decision. ### The application is collaborative, not the robot (again) Worth repeating because it's the whole game. The CE mark / conformity you produce is for the **robot system / cell**, under the Machinery Directive (now Machinery Regulation EU 2023/1230) in Europe, or the relevant OSHA/ANSI/RIA framework in the US (**ANSI/RIA R15.06**, harmonized with ISO 10218; **RIA TR R15.806** mirrors ISO/TS 15066). The integrator owns this. ### The process: ISO 12100 The backbone is **ISO 12100** (risk assessment and risk reduction): identify hazards, estimate risk (severity × probability × exposure × avoidance), reduce by the hierarchy of controls (inherently safe design → safeguarding → information for use), then re-assess. For a cobot cell you enumerate every hazard — the moving arm, the end effector, the workpiece, the process, electrical, the surrounding equipment — and decide, per hazard, which collaboration mode or guard addresses it. > **Safety rule:** The hierarchy of controls is ordered for a reason. *Eliminate* the hazard (round the corners, remove the pinch point) before you *guard* it (scanner, fence) before you *warn* about it (signage, training). PFL is an inherently-safer-design control for the *arm*; it does nothing for a hazardous *tool*. ### Speed throttling and zones A common, robust pattern: the arm runs **fast in a guarded transit zone** (SSM or fenced) and **slow in the collaborative workstation** (PFL), switching modes via safe-rated zone monitoring. You get throughput where no human is *and* collaboration where they are. Safety-rated speed and position limits (configured in the safety controller, validated, and locked) enforce the switch. ### The end effector and workpiece are hazards too This kills more "collaborative" deployments than anything else. The arm can be perfectly PFL-compliant while the application is not: - **The gripper:** pinch points between fingers, or a part-present sensor that doesn't stop closing on a finger. See [end effectors & grippers](/posts/end-effectors-grippers-ultimate-guide/) — collaborative grippers (Robotiq, OnRobot, Schunk Co-act) are explicitly designed with rounded jaws and force limits for exactly this reason. - **The workpiece:** sharp sheet-metal edges, hot parts, glass, anything with a small contact radius. ISO/TS 15066 limits assume blunt contact; a sharp edge blows the pressure limit at trivial force. - **The process:** a deburring spindle, a welding torch, a laser, a fluid jet — none of these are collaborative regardless of how the arm moves. ### Why most "cobots" run fenced at full speed Given all of the above, many integrators reach the rational conclusion: it's cheaper and faster to *guard* the cell (a small light curtain or scanner is inexpensive) and run the cobot fast than to do the full PFL validation, derate the speed, and re-validate every time the part changes. So they buy the cobot for the *programming and redeployability*, fence it, and run it at 1,500 mm/s. That's not a failure — it's often the correct engineering tradeoff. Just call it what it is. ## Real applications & ROI Where cobots actually earn their keep, with the honest economics. ### Machine tending The number-one cobot application. A CNC mill, lathe, injection molder, or press needs parts loaded and unloaded — dull, repetitive, sometimes ergonomically nasty work. A cobot on a cart rolls up to the machine, an operator teaches the load/unload in an hour, and it runs lights-out or frees an operator to tend three machines instead of one. Often deployed SRMS (robot stops when operator enters) or lightly guarded. **ROI is typically 6–18 months**, driven by labor reallocation and machine uptime, not headcount elimination. ### Palletizing The killer app for the new high-payload cobots (UR20/UR30, FANUC CRX-25iA, Doosan H-series). End-of-line palletizing is heavy, repetitive, injury-prone (lower-back claims are expensive). A 20–30 kg cobot with a vacuum or clamp gripper on a lift column stacks boxes all shift. Cobot palletizers from vendors like Robotiq, Premier Tech, and Columbia/Okura's cobot lines productized this. **ROI often under 12 months** where you're displacing a manual palletizing station with real injury risk. ### Assembly Screwdriving, press-fits, snap assembly, small-part insertion. This is where **force control earns its money** — compliant insertion, torque-verified screwdriving (UR+ screwdriving tools log torque per fastening for traceability). Genuinely collaborative side-by-side work shows up here: human does the dexterous bit, cobot does the repetitive fastening. ### Inspection & quality Cobot + camera or laser profilometer running a fixed inspection path: dimensional checks, surface inspection, reading gauges, taking measurements at stations a human can't reach repeatably. The cobot's modest repeatability (±0.05 mm) is plenty for most vision-based inspection, and free-drive makes path teaching trivial. ### Lab automation & life sciences A fast-growing segment: pipetting, plate handling, sample sorting in labs where the bench is shared with humans and space is tight. Cleanroom-rated cobots and the smaller arms (UR3e, Franka, ABB YuMi for dual-arm dexterous tasks) fit because they're compact, quiet, precise enough, and safe around lab staff. Throughput is modest but the value is 24/7 unattended runs and freed scientist time. ### The ROI honesty check Cobots rarely win on raw speed or cost-per-part against a dedicated fixed automation cell at high volume — a hard-tooled machine will out-throughput them every time. They win on **flexibility, fast deployment, low integration cost, and redeployability** at *low-to-medium volume and high mix*. If you run one product a million times, build fixed automation. If you run fifty products a few thousand times each and the mix changes quarterly, the cobot's redeployability is the entire business case. ## The 2026 cobot market & landscape The field in 2026 is mature, crowded, and segmenting. ### The vendors - **Universal Robots** (Teradyne-owned) remains the volume and ecosystem leader — the e-Series (UR3e/5e/10e/16e), plus the higher-payload **UR20 (20 kg)** and **UR30 (30 kg)**. Strength: ecosystem (UR+), maturity, resale, training base. Sensing: motor-current-based with refined estimation. - **FANUC** brings industrial pedigree to the **CRX** line (CRX-5iA/10iA/20iA/25iA), torque-sensing, famously smooth contact behavior, the green-and-white styling, and FANUC's legendary reliability and service network. The CRX-25iA pushed FANUC cobots into palletizing. - **Techman Robot** (Quanta-affiliated, Taiwan) differentiates with a **built-in vision system** and TMflow's flow-based programming — vision-native cobots for inspection and pick-and-place. - **Doosan Robotics** (Korea) offers one of the broadest ranges (A/M/H/E/P series), torque sensors in all six joints, and the heavy **H-series (up to 25 kg)**; aggressive on payload and price. - **KUKA LBR iiwa** is the 7-axis, torque-sensing-in-every-joint pioneer — the gold standard for sensitive, redundant-kinematics collaborative work, priced accordingly. The newer **LBR iisy** targets easier deployment. - **ABB** offers **GoFa** (CRB 15000, up to 12 kg, torque sensing) for single-arm collaborative work and the iconic dual-arm **YuMi** (IRB 14000) for small-parts dexterous assembly. - **Franka Emika** (now Franka Robotics) is the research/fine-manipulation favorite — exceptional torque sensing and backdrivability, link-side sensors, the natural platform for force-rich and learning-based manipulation. - Plus a long tail: Fanuc-adjacent and Chinese entrants (AUBO, JAKA, Han's, Elite, Dobot), Hanwha, Kassow (high-payload 7-axis), Rethink's legacy (Baxter/Sawyer, defunct but influential). ### Trends: higher payload, vision-native, easier still Three vectors define 2026: **payload climbing** (20–30 kg cobots are now normal, opening palletizing and heavy tending), **vision baked in** (Techman-style integrated vision, AI pick), and **deployment getting even easier** (AI-assisted programming, natural-language task setup creeping in). ### The humanoid overlap The most interesting 2026 dynamic: the **humanoid wave runs on cobot DNA.** A torque-controlled, backdrivable, force-limited joint is exactly the cobot actuator — humanoids just use more of them, arranged as legs and dual arms, with whole-body force control instead of a single arm's. The sensing (joint torque), control (impedance), and safety philosophy (force limiting around humans) are continuous from cobot to humanoid. Several cobot vendors and their suppliers are now also humanoid-actuator suppliers. If you understand cobot joints, you're 80% of the way to understanding a humanoid limb — see the [humanoid robot hardware guide](/posts/humanoid-robot-hardware-ultimate-guide/). ## Selecting a cobot A disciplined selection sequence, then a real spec table. ### Step 1: define the application and the safety case first Before payload and reach, answer: *Will this run collaboratively (PFL/SSM/SRMS/HG), or guarded?* If guarded, you're choosing on speed/payload/price like an industrial arm and the "collaborative" features are just nice-to-haves. If truly collaborative, the gripper, workpiece, and acceptable speed constrain everything — settle those before the arm. ### Step 2: payload and reach with margin **Payload** must include the end effector *and* the workpiece, with the gripper mass often eating 1–3 kg before you pick anything. Size with ~20–30% margin, and remember collaborative-mode speed *drops* as payload rises (the \(\sqrt{m_{eff}}\) effect). **Reach** must cover the worst-case point in the work envelope plus tool length — and check that the *useful* envelope (where the arm has good dexterity, not folded against a singularity) covers it. ### Step 3: sensing and force-control needs Fine force-controlled assembly or research → torque-sensing cobot (Franka, iiwa, FANUC CRX, Doosan). Simple pick/place/tend → current-estimation is fine and cheaper (UR e-Series). Vision-heavy → Techman or add a vision system. ### Real-product comparison table | Model | Payload | Reach | Repeatability | Sensing | Collab. TCP speed | Weight | Notes | |---|---|---|---|---|---|---|---| | **UR3e** | 3 kg | 500 mm | ±0.03 mm | Motor current | ~1 m/s | 11 kg | Tabletop, light assembly, lab | | **UR5e** | 5 kg | 850 mm | ±0.03 mm | Motor current | ~1 m/s | 20.6 kg | The workhorse SME cobot | | **UR10e** | 12.5 kg | 1300 mm | ±0.05 mm | Motor current | ~1 m/s | 33.5 kg | Tending, packaging, longer reach | | **UR20** | 20 kg | 1750 mm | ±0.05 mm | Motor current | ~1 m/s (derated) | 64 kg | Palletizing, heavy tending | | **UR30** | 30 kg | 1300 mm | ±0.05 mm | Motor current | (derated) | 63.5 kg | High payload, compact reach | | **FANUC CRX-10iA** | 10 kg | 1249 mm | ±0.04 mm | Joint torque | ~1 m/s | 39 kg | Smooth contact, FANUC reliability | | **FANUC CRX-25iA** | 25 kg | 1889 mm | ±0.04 mm | Joint torque | (derated) | ~95 kg | Palletizing-class cobot | | **Techman TM12 / TM14** | 12 / 14 kg | 1300 / 1100 mm | ±0.1 mm | Joint torque | ~1.3 m/s | ~33 kg | Built-in vision system | | **Doosan H2515** | 25 kg | 1500 mm | ±0.1 mm | 6× joint torque | ~1 m/s | 76 kg | Heavy-payload, torque in all joints | | **Doosan M1013** | 10 kg | 1300 mm | ±0.05 mm | 6× joint torque | ~1 m/s | 33 kg | Versatile mid-range | | **KUKA LBR iiwa 14** | 14 kg | 820 mm | ±0.10 mm | 7× joint torque | varies | 29.9 kg | 7-axis, sensitive assembly | | **ABB GoFa CRB 15000** | 5–12 kg | 950–1620 mm | ±0.02–0.05 mm | Joint torque | ~1 m/s | 27–63 kg | Single-arm collaborative | | **ABB YuMi IRB 14000** | 0.5 kg/arm | 559 mm | ±0.02 mm | Current + design | ~1.5 m/s | 38 kg | Dual-arm small-parts assembly | | **Franka Research 3** | 3 kg | 855 mm | ±0.1 mm | 7× link torque | varies | 18 kg | Research, fine force manipulation | (Numbers are nominal manufacturer figures for orientation; verify the exact variant against current datasheets — payloads, reaches, and especially collaborative speeds vary by model revision and safety configuration.) ### Step 4: integration checklist - **Mounting:** floor, wall, ceiling, table, or cart. Confirm the arm supports the orientation and that the safety config accounts for gravity direction. - **Flange & EOAT:** ISO 9409-1 flange; tool I/O (digital, IO-Link, fieldbus); cable routing through the wrist if available. - **Controller & fieldbus:** the cell PLC integration (PROFINET/PROFISAFE, EtherCAT/FSoE, Ethernet/IP CIP Safety) for safe signals and process I/O. - **Safety devices:** scanners/curtains for SRMS/SSM; the measurement plan for PFL validation. - **Ecosystem:** is the gripper/vision/tool a certified plug-in (UR+, FANUC partner, etc.) or a custom integration? > **Safety rule:** Lock and document the safety configuration (speed/force/position limits) and treat any change as a re-validation trigger. The single most common audit failure is a cell whose installed safety limits no longer match the validated risk assessment because someone "just bumped the speed." ## Frequently asked questions **Is a cobot inherently safe to work next to?** No. A cobot is *capable* of safe collaborative operation under the right risk assessment, mode, speed, payload, and end effector. The arm out of the box is collaborative-*capable*, not safe-by-default. Speed, a hazardous tool, a sharp workpiece, or a pinch point against a fixture can each make a cobot cell unsafe. Safety is a property of the validated application, not the robot. **What's the difference between ISO 10218 and ISO/TS 15066?** ISO 10218 (parts 1 and 2) is the normative safety standard for industrial robots and robot systems, including collaborative operation — it defines the four collaboration modes. ISO/TS 15066 is a *technical specification* that supplements it with the detailed guidance and, crucially, the **biomechanical force/pressure limit tables** for power & force limiting. The 2025 revision of ISO 10218 absorbed much of TS 15066's content into the main standards. In the US, ANSI/RIA R15.06 and RIA TR R15.806 are the harmonized equivalents. **What are the four collaboration modes again?** Safety-rated monitored stop (robot stops when human enters), hand guiding (operator moves the arm via a safe guiding device), speed and separation monitoring (robot speed scaled to distance, full stop below a protective separation distance), and power & force limiting (contact forces kept below injury thresholds so a moving robot may touch a human). Only PFL permits contact with a moving robot. **Do cobots really need no fencing?** Sometimes. A properly risk-assessed PFL application — low speed, blunt geometry, no pinch points, safe tool and workpiece — can run fence-free. But many cobot cells need *some* safeguarding (a scanner for SSM/SRMS, a guard around a hazardous tool), and many integrators deliberately fence and run fast. "No fencing" is an outcome you earn with a risk assessment, not a guarantee you buy. **How fast can a cobot move in collaborative mode?** In PFL, typically 250–1,000 mm/s TCP, derated as payload and effective mass rise, because contact force scales with speed and the square root of effective mass. Run guarded (not in contact-permitted mode), the same arm can hit 1,000–2,000 mm/s. Heavier payloads force lower collaborative speeds — that's physics, not a marketing limitation. **Joint torque sensors vs motor-current sensing — which should I care about?** For simple pick-and-place and machine tending, motor-current estimation (e.g. UR e-Series) is fine and cheaper. For delicate force-controlled assembly, polishing, or research, joint torque sensors (Franka, KUKA iiwa, FANUC CRX, Doosan) give far cleaner low-speed contact detection and true impedance control. Torque sensing costs more but unlocks applications current-estimation cobots struggle with. **What's the difference between transient and quasi-static contact limits?** Quasi-static (clamping/trapping, force sustained against a fixed surface) limits are the strict ones — e.g. ~130 N at the skull. Transient (free impact, body free to recoil, brief contact) limits are roughly double. If a body part can be trapped, you must use the quasi-static limit. Designing out pinch points lets more of your trajectory qualify as transient and run faster. **How do I actually validate a PFL application?** Physically measure it. Drive the robot into a calibrated biofidelic force/pressure measurement device (a load cell on a body-region-matched spring) at the worst-case point and configuration, read peak force, and verify it's under the ISO/TS 15066 limit. Separately, use pressure-indicating film (Fujifilm Prescale) to check local pressure over the contact patch — sharp edges blow the pressure limit even when total force is fine. Calculation is a design target; measurement is the proof. **Can I put any gripper on a cobot and stay collaborative?** No. The end effector is part of the safety case. Pinch points between fingers, sharp jaws, or a gripper that doesn't force-limit its closing can each violate PFL even if the arm is compliant. Use collaborative-rated grippers (Robotiq, OnRobot, Schunk Co-act) with rounded geometry and force limits, and include the workpiece in the assessment — see the [grippers guide](/posts/end-effectors-grippers-ultimate-guide/). **Are higher-payload cobots (20–30 kg) real, or marketing?** Real and useful — the UR20/UR30, FANUC CRX-25iA, and Doosan H-series genuinely opened palletizing and heavier machine tending to cobots. The honest caveat: at high payload they run collaborative work *slowly* (the effective-mass speed limit) and often run guarded at full speed for throughput. The value is still flexibility and easy deployment, not collaborative speed. **How are cobots related to humanoid robots?** Closely. The humanoid joint is the cobot actuator — frameless BLDC + strain-wave (or planetary/cycloidal) gearbox + torque sensing + impedance control — just used in larger numbers and arranged for legs and dual arms with whole-body force control. The sensing and safety philosophy carry straight over. Understanding cobot joints is most of the way to understanding humanoid limbs; see the [humanoid hardware guide](/posts/humanoid-robot-hardware-ultimate-guide/). **What's the realistic ROI and payback on a cobot?** For machine tending and palletizing displacing manual, injury-prone work, payback is commonly 6–18 months, driven by labor reallocation, machine uptime, and reduced injury claims — not headcount elimination. Cobots lose to fixed automation at high volume/low mix and win on flexibility, low integration cost, and redeployability at low-to-medium volume and high mix. Match the tool to the volume-mix profile. ## Changelog - **2026-05-23** — Initial publication. --- # Humanoid Robot Hardware: The Ultimate Guide URL: https://blog.robo2u.com/posts/humanoid-robot-hardware-ultimate-guide/ Published: 2026-05-21 Updated: 2026-06-20 Tags: humanoid-robots, tesla-optimus, figure, unitree, actuators, degrees-of-freedom, bipedal-locomotion, embodied-ai, robotics-hardware, guide Reading time: 38 min > An engineer's teardown of 2026 humanoid robot hardware — actuators, hands, legs, sensing, power, compute — with real DoF, mass, torque, and cost numbers, plus an honest read on teleop demos. A humanoid robot is the hardest commodity in robotics: a bipedal, two-armed, dexterous machine that has to balance, walk, manipulate, perceive, and think — all inside a power and mass budget roughly the size of a person. Every subsystem fights every other one. Make the actuators stronger and you add mass, which needs stronger actuators. Add battery for runtime and you add mass, which cuts runtime. The whole discipline is an exercise in not losing that fight too badly. This guide is the long version, subsystem by subsystem: the 2026 roster and what's actually shipping, degrees of freedom and how they're spent, the actuator problem (which is *the* problem), hands, legs, sensing, power, compute, and the uncomfortable truth about teleoperation. Real numbers with units, real robots, and opinions with reasons. The goal is that you finish able to look at a humanoid spec sheet — or a glossy launch video — and know what's real, what's marketing, and what's quietly being left out. **The take**: In 2026, humanoid *hardware* is far ahead of humanoid *autonomy*. The bodies can walk, balance, and grasp; the actuators are good enough; the bill of materials is on a credible path to under $50k. What is not solved is letting the robot decide what to do on its own in an unstructured environment. A large fraction of the impressive "autonomous" manipulation demos you have seen are teleoperated, or are narrow policies trained on exactly that scene. Read every demo with that prior. The bottleneck is not motors anymore — it's the software stack and the data to train it. Companion reading: [robot actuators](/posts/robot-actuators-ultimate-guide/), [brushless DC motors](/posts/brushless-dc-motors-bldc-ultimate-guide/), [gearboxes (harmonic & cycloidal)](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/), and [legged & quadruped robot hardware](/posts/legged-quadruped-robot-hardware-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [Why humanoids now](#why-now) 3. [The 2026 humanoid roster](#roster) 4. [Degrees of freedom & kinematics](#dof) 5. [The actuator problem](#actuators) 6. [Hands & manipulation hardware](#hands) 7. [Bipedal locomotion hardware](#legs) 8. [The sensing suite](#sensing) 9. [Power & thermal](#power) 10. [Onboard compute](#compute) 11. [The teleoperation reality](#teleop) 12. [Manufacturing & cost](#cost) 13. [The 2026→2027 outlook](#outlook) 14. [Frequently asked questions](#faq) ## Key takeaways - A 2026 humanoid is typically **1.5–1.8 m tall, 35–80 kg, with 28–60 actuated degrees of freedom**, a 1–5 hour battery, and a payload of 5–25 kg. The spread is wide because the field hasn't converged on a design point. - The **form-factor argument** is the whole thesis: the world is built for human bodies — stairs, door handles, shelves, vehicles — so a human-shaped robot is a general-purpose adapter to existing infrastructure. That's the bet. It is not obviously correct for any single task, only for *generality*. - The recent unlock is software, not hardware: **LLMs and vision-language-action (VLA) models** gave a plausible path to general behavior. The body has been buildable for years; the brain wasn't. - **The actuator is the central hardware problem.** Torque density, efficiency, backdrivability, and thermal limits set what the robot can do. The live debate is rotary quasi-direct-drive (QDD) vs. linear ball-screw actuators; Optimus famously uses a deliberate *mix* of both. - **Hands are the hardest sub-problem and the worst ROI per dollar.** A genuinely dexterous hand can carry 11–20+ DoF, tendon or linkage drives, and tactile sensing, and can cost as much as the rest of the arm. Most shipping humanoids run simplified hands. - **Walking is "solved"; robust walking is not.** Flat-floor bipedal locomotion is a demo. Walking on debris, slopes, and stairs while carrying a load and being shoved is still hard and still where robots fall. - **Runtime is a real constraint.** Most humanoids draw a few hundred watts standing and 1–3 kW under load, giving 1–5 hours from a ~1–2.3 kWh pack. Continuous 24/7 operation means battery swaps or tethers, not magic. - **Onboard compute is split** between a real-time control layer (kHz joint loops on an MCU/SoC) and an AI inference layer (a GPU/SoC like Jetson Thor or custom silicon running VLA models at much lower rates). - **Teleoperation is everywhere** — both as the honest way to collect manipulation training data and as the dishonest way to fake autonomy in a launch video. Learn to tell them apart (next-day section). - The **path to <$50k** runs through actuators and hands, which dominate the bill of materials. Volume, vertical integration, and design-for-manufacture (DfM) are the levers; exotic materials are not. - 2026 reality: **bodies are good, brains are immature, data is the moat.** Expect strong progress in structured commercial settings (warehouses, fixed manufacturing cells) and slow progress in the open-ended home. ## Why humanoids now The question is not "can we build a human-shaped robot" — we have for decades, going back to Honda's P2 in 1996 and ASIMO in 2000. The question is "why is everyone building them *now*, with serious money." Three things changed. ### The form-factor argument The world is full of infrastructure designed for a 1.7 m bipedal primate with two five-fingered hands: 0.7–0.9 m countertops, 0.8 m door openings, stair risers around 0.18 m, steering wheels and pedals, tools with handles sized for a human grip. A wheeled arm can't climb the stairs; a fixed cell can't move to the work. A humanoid is a general-purpose physical adapter to all of that without re-engineering the environment. > **Rule of thumb:** The humanoid form is rarely the *optimal* shape for any single task. A wheeled base beats legs on a flat warehouse floor; a fixed gantry beats an arm for repetitive pick-place. The humanoid bet is that one body that can do *everything passably* beats ten special-purpose machines — because deployment, retraining, and capital flexibility dominate at scale. That's a real argument and also a convenient one for raising capital. Be honest about which half is talking. ### The software unlock The body was buildable in 2010. What wasn't buildable was a controller that could *decide what to do*. Classical robotics scripted every motion; that doesn't generalize to "tidy this room." Two developments cracked the ceiling: - **Large language / multimodal models** that can take a goal in natural language and produce a plan, and can ground that plan in what a camera sees. - **Vision-language-action (VLA) models** — policies that map pixels and a language goal directly to motor commands, trained on large demonstration datasets. This is the architecture behind most 2026 manipulation work (Figure's Helix, Physical Intelligence's π-series, Google's RT-2 lineage, NVIDIA's GR00T). Suddenly a humanoid had a plausible path to general behavior. That's why the money showed up. ### The honest state Here's the part the videos don't say out loud. The hardware is *capable* — a 2026 humanoid can physically perform almost any single human task you'd show in a demo. The autonomy is *immature* — letting it choose and chain those tasks reliably in an environment it hasn't been trained on is unsolved. > **The honest take:** We have working bodies and toddler brains. Progress in 2026–2027 is gated by data and learning algorithms, not by torque density or DoF. Anyone selling you "the hardware is the hard part, and we've cracked it" is half right and using it to skip the half they haven't. This guide is about the hardware. Just don't confuse a great body for a finished product. ## The 2026 humanoid roster The field is crowded. Below is the serious tier as of mid-2026. Numbers are best-available public figures; vendors disclose selectively and "spec" often means "target" or "best demo unit," so treat anything to two significant figures as approximate and anything about price as aspirational. | Robot | Height | Mass | DoF (approx) | Payload | Runtime | Price target | Actuation notable | |---|---|---|---|---|---|---|---| | **Tesla Optimus (Gen 2/3)** | ~1.73 m | ~57–73 kg | ~28 body + ~11–22/hand | ~9 kg (claimed ~20 kg) | ~2–5 hr | <$20–30k (target) | Mixed rotary + linear; in-house actuators | | **Figure 02 / 03** | ~1.68 m | ~60–70 kg | ~30+ body | ~20 kg | ~4–5 hr | undisclosed | In-house actuators; Helix VLA | | **1X Neo** | ~1.65 m | ~30 kg | ~30+ | small | ~2–4 hr | ~$20k / subscription | Tendon-driven, deliberately low-force/soft | | **Boston Dynamics Atlas (electric)** | ~1.75–1.9 m | ~90 kg | ~56 (incl. hands) | ~30 kg sustained | ~4 hr | not for sale | All-electric custom actuators; extreme range of motion (360° hip/waist/neck joints) | | **Unitree H1** | ~1.8 m | ~47 kg | ~19 (no hands) | ~30 kg rated | ~2 hr | ~$90k+ | QDD joint motors; fast walker/runner | | **Unitree G1** | ~1.27 m | ~35 kg | ~23–43 | small | ~2 hr | ~$16k+ | QDD; aggressively cheap | | **Apptronik Apollo** | ~1.73 m | ~73 kg | ~28+ | ~25 kg | ~4 hr (swap pack) | ~$50k (target) | Linear actuators, modular, hot-swap battery | | **Agility Digit** | ~1.75 m | ~65 kg | ~16–20 | ~16 kg | ~2–4 hr | lease/RaaS | Bird-like legs (rearward knee), warehouse-tuned | | **Sanctuary Phoenix** | ~1.7 m | ~70 kg | ~20+ (rich hands) | ~25 kg | undisclosed | undisclosed | Hydraulic-ish high-DoF hands, teleop data focus | A few honest observations: - **DoF counts are slippery.** Some vendors count hand joints, some don't; some count coupled tendon joints as one DoF, some as several. A "43-DoF G1" and a "19-DoF H1" are not as far apart as they sound once you normalize for hands. - **Mass spans ~30–90 kg.** 1X Neo at ~30 kg made a deliberate choice to be light and weak (safer around people, tendon-driven, lower torque); Atlas electric at ~90 kg made the opposite choice (force and range of motion for spectacular dynamics). Both are defensible; they're solving different problems. - **Price targets are mostly fiction until volume.** Unitree G1's ~$16k is real and shipping (it's a research/education platform, not a labor robot). Optimus's "<$20–30k at scale" is a manufacturing thesis, not a 2026 price. - **Agility Digit** is the outlier worth respecting: it deliberately *isn't* anthropomorphic in the legs (reversed knees, like an ostrich) and is the furthest along in real paid warehouse deployments precisely because it picked a narrow, structured job. > **The honest take:** The most commercially advanced humanoid in 2026 is the least "general." Digit makes money moving totes in warehouses because the task is bounded. The robots with the flashiest home demos make the least money. That ordering tells you where the technology actually is. ## Degrees of freedom & kinematics Degrees of freedom (DoF) are the independently actuated joints — the count that sets how many ways the robot can move. A human has roughly 230 DoF if you count everything including the spine and each finger joint; a humanoid robot dramatically simplifies that. For motion planning across all those joints, see the [motion planning & kinematics guide](/posts/motion-planning-kinematics-ultimate-guide/). ### Typical DoF budget A capable 2026 humanoid lands around **28–60 actuated DoF**. Here's a representative split for a ~30-DoF body (hands counted separately, which is the honest way to do it): ```text DoF accounting — representative ~30-DoF humanoid (excl. hands) Each leg: 6 DoF ×2 = 12 (hip 3, knee 1, ankle 2) Each arm: 7 DoF ×2 = 14 (shoulder 3, elbow 1, wrist 3) Torso/waist: 1–3 DoF (yaw, sometimes pitch/roll) Neck/head: 2–3 DoF (pan, tilt, sometimes roll) ---------- Body total: ~28–32 DoF Hands (optional): 6–22 DoF each — often DOUBLES the whole count ``` The structure is near-universal because it mirrors human kinematics: - **6 DoF per leg** is the minimum for placing the foot at an arbitrary position *and* orientation in space — 3 at the hip, 1 at the knee, 2 at the ankle (pitch + roll). Drop the ankle roll and you lose the ability to keep the foot flat on uneven ground. - **7 DoF per arm** gives a redundant arm: 6 DoF reach any pose, the 7th lets the elbow swing without moving the hand (reconfiguration around obstacles). Cheaper humanoids use 6 DoF arms and accept the loss. - **Torso yaw** matters more than people expect — it dramatically extends reach and lets the robot twist to place a load without stepping. ### Why not more DoF? Every DoF is an actuator: a motor, a gearbox, a driver, an encoder, wiring, mass, cost, and a failure point. The marginal DoF has to earn its place. This is why hands are contentious — going from a 6-DoF gripper-hand to a 22-DoF anthropomorphic hand can add more actuators than the entire rest of the arm, for capability you can't yet reliably control. > **Rule of thumb:** Count DoF *excluding hands* when comparing locomotion-and-reach capability, and count hands separately. A vendor quoting "40+ DoF" is almost always front-loading finger joints to inflate the headline. ## The actuator problem If you remember one thing from this guide: **the actuator is the hardware problem.** Not sensors, not compute — those ride Moore's-law-adjacent curves and are largely commoditized. The actuator is where physics pushes back hardest, and it's the single biggest cost, mass, and capability driver in the machine. Start with the [robot actuators guide](/posts/robot-actuators-ultimate-guide/), the [BLDC motors guide](/posts/brushless-dc-motors-bldc-ultimate-guide/), and the [gearboxes guide](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/) for the fundamentals; here's how they specialize for humanoids. ### What a humanoid actuator must do A humanoid joint actuator has a brutal spec: high peak torque (to lift, to catch a fall), high torque density (because mass at the joint is mass the robot must also carry and accelerate), backdrivability and force control (for safe contact and balance), high bandwidth (to react to disturbances in milliseconds), and decent efficiency (so the battery lasts). No single technology nails all of these, which is why the field is split. ```text Torque density — the figure of merit τ/m = joint torque per actuator mass [N·m / kg] A good 2026 humanoid hip/knee actuator: peak torque ~150–360 N·m, mass ~1.5–4 kg → ~60–120 N·m/kg peak, ~20–50 N·m/kg continuous Thermal, not torque, is usually the real ceiling: continuous τ is limited by I²R heating in the windings, peak τ is limited by demagnetization and structure. You can hit peak for ~seconds; continuous is what you live on. ``` ### The rotary QDD camp **Quasi-direct-drive (QDD)** uses a high-torque BLDC motor with a *low* single-stage gear ratio (typically 6:1 to 10:1). The low ratio means low reflected inertia and friction, which gives you backdrivability and clean proprioceptive force estimation from motor current — no force sensor needed. This is the MIT Cheetah lineage and is what makes Unitree's quadrupeds and humanoids so dynamic. - **Pros:** transparent, backdrivable, great for impacts and balance, force control "for free," mechanically simple, robust. - **Cons:** low ratio means you need a *big* motor for high torque, which is heavy and draws a lot of current to hold a static load (no mechanical advantage to lean on). Holding a heavy arm extended is thermally expensive. ### The linear ball-screw camp A **linear actuator** — a BLDC motor driving a ball-screw or roller-screw, pushing a rod that levers the joint — trades transparency for efficiency at high static loads. The screw provides huge mechanical advantage, so holding a load draws little current, and the package can be compact and very high-force. - **Pros:** excellent force density, efficient at holding static loads, compact, naturally high stiffness. - **Cons:** poor backdrivability (the screw resists being driven backward), so force control needs a load cell; the screw and its bearings wear; impact loads go straight into the screw nut. ### Optimus's deliberate mix Tesla's Optimus is the cleanest public example of refusing to pick a side. It reportedly uses **both** — rotary actuators where backdrivability and range of motion matter, and **linear actuators** where high static force in a compact envelope matters (notably knees and other high-load joints). Tesla designed its actuators in-house specifically to optimize this mix per-joint, which is a manufacturing and integration bet as much as a control one. | Approach | Torque/force density | Backdrivable | Static-hold efficiency | Force sensing | Best joints | |---|---|---|---|---|---| | **Rotary QDD** (BLDC + 6–10:1) | High (rotary) | Yes (good) | Poor (current-hungry) | From motor current | Hips, shoulders, ankles, dynamic joints | | **Rotary high-ratio** (harmonic) | High, compact | No | Good | Needs torque sensor | Wrists, neck, low-speed precision joints | | **Linear ball/roller-screw** | Very high (force) | No (poor) | Excellent | Needs load cell | Knees, high-load lever joints | | **Series-elastic (SEA)** | Moderate | Yes | Moderate | From spring deflection | Legs/ankles where impact tolerance matters | > **The honest take:** There is no universal winner. The right answer is per-joint: QDD where you need to feel the world and survive impacts, screws where you need to hold a heavy static load efficiently, harmonic drives where you need compact precision at low speed. A vendor that uses one technology everywhere has optimized for manufacturing simplicity, not performance. ### The thermal trap The most common field failure mode isn't a torque limit — it's heat. Continuous torque is capped by I²R losses heating the windings; exceed it and you cook the motor or trip thermal derating. A humanoid holding a 5 kg object at arm's length can be drawing near-continuous-limit current with the arm *not moving at all*. This is why static poses, not dynamic motion, often dominate the thermal budget, and why screw drives (which hold cheaply) are attractive for load-bearing joints. ## Hands & manipulation hardware The hand is where humanoids go to die. It is simultaneously the highest-value subsystem (manipulation is the point) and the hardest, most expensive, least mature one. See the [end-effectors & grippers guide](/posts/end-effectors-grippers-ultimate-guide/) and the [robot sensors guide](/posts/robot-sensors-ultimate-guide/) for the broader landscape; here's the humanoid-specific picture. ### Why hands are so hard A human hand has ~27 DoF, dozens of muscles, thousands of mechanoreceptors, and a control system tuned over a lifetime. It does fine force control, in-hand manipulation, and tactile inference simultaneously. Replicating even a fraction of that inside a ~0.5 kg package the size of a real hand, while routing actuation and sensing, is genuinely at the frontier. The tradeoffs stack against you: more fingers and joints mean more actuators (and you can't fit motors in the fingers — they're too small), so you move actuation to the forearm and transmit it down. Both transmission methods have costs. ### Tendon vs. linkage drives - **Tendon-driven** (cables routed over pulleys, motors in the forearm) — this is how human hands work and how most high-DoF robot hands work (Shadow Hand, many research hands, 1X Neo). Pros: compact fingers, biomimetic, can be lightweight and compliant. Cons: cables stretch, fray, and need tensioning; friction and routing make precise force control hard; maintenance is real. - **Linkage-driven** (rigid four-bar and gear linkages) — motors drive mechanical linkages directly. Pros: stiff, precise, durable, no cable maintenance. Cons: bulkier, fewer independent DoF for the volume, less compliant. Most production humanoid hands underclaim DoF deliberately — a **6-DoF hand** (one actuator per finger plus a thumb opposition) covers a huge fraction of grasps at a fraction of the cost and control burden of a 16–22-DoF hand. The capability-per-dollar curve is brutally diminishing past simple grasping. ### Tactile sensing Vision alone cannot tell you grip force, slip, or contact location when the hand occludes the object. Tactile sensing is essential for dexterous manipulation and is itself immature: - **Force/torque at the wrist** — cheap, coarse, common. - **Fingertip force sensors** — strain gauges or barometric/MEMS sensors per fingertip. - **High-resolution optical tactile** (GelSight-style, where a camera images a deformable gel) — rich contact geometry and slip detection, but bulky and adds a camera per fingertip. ### Cost reality | Subsystem | Rough share of a humanoid BoM | Why | |---|---|---| | **Two dexterous hands** | 15–30% | High DoF, tiny precision actuators, tactile sensing, low-volume | | Leg actuators (×2 legs) | 20–30% | High-torque motors + gearboxes/screws, the most mass | | Arm actuators (×2 arms) | 10–20% | 7 DoF each, moderate torque | | Battery pack | 5–10% | Cells + BMS + thermal | | Compute | 5–10% | AI SoC/GPU + RT controller | | Sensors (cameras/IMU/F-T) | 5–10% | Mostly commoditized | | Structure/skin/wiring/assembly | 15–25% | Frame, covers, harness, labor | > **The honest take:** A pair of genuinely dexterous hands can cost as much as both legs. That's why almost every shipping humanoid runs simplified hands and saves the 20-DoF marvel for the demo reel. If a robot is doing real work in 2026, look at its hands — they're probably grippers wearing finger-shaped covers. ## Bipedal locomotion hardware Bipedal walking is the canonical humanoid party trick, and it is both more solved and less solved than it looks. For the broader legged landscape and where quadrupeds win, see the [legged & quadruped robot hardware guide](/posts/legged-quadruped-robot-hardware-ultimate-guide/). ### The leg A humanoid leg is typically **6 DoF**: 3 at the hip (yaw, roll, pitch), 1 at the knee (pitch), 2 at the ankle (pitch, roll). The hip and knee carry the highest torque demands — a knee actuator on a 70 kg robot may need **150–360 N·m peak** to stand up from a squat or absorb a landing. This is exactly where linear screw actuators earn their place: high static-hold force, efficiently. The **ankle** is special. Two DoF (pitch + roll) let the foot stay flat on uneven ground and let the robot shift its center of pressure within the foot — the primary fine balance authority. Some designs put the ankle actuators up near the knee and use linkages to keep distal mass (and thus leg inertia) low, which improves swing dynamics. Distal mass is the enemy: every kg at the ankle is a kg the hip must accelerate every step. ### Why "solved" walking isn't robust walking Flat-floor walking with known geometry is a controls exercise that's been demonstrated for years. **Robust** walking — over debris, slopes, stairs, soft ground, while carrying a variable load and being shoved by a person — is where humanoids still fall. The hardware needs: - **Fast, backdrivable joints** to react to disturbances within milliseconds (QDD or SEA help here). - **Good foot force sensing** to know when and how hard each foot contacts. - **Whole-body control (WBC)** running at high rate to coordinate all ~28 joints to keep the center of mass over a viable support region. ### ZMP, WBC, and what the hardware must enable Classical bipeds used the **Zero Moment Point (ZMP)** criterion — keep the point where ground-reaction forces produce no horizontal moment inside the support polygon (the foot, or the convex hull of both feet). ZMP gives the flat-footed, knees-bent, slightly robotic gait of older humanoids. It's reliable and conservative. Modern dynamic humanoids use **whole-body control** and **model-predictive control (MPC)**, treating the whole robot as a coupled dynamic system and planning ground-reaction forces over a short horizon. This allows toe-off, heel-strike, running, and recovery from large pushes — but it demands hardware that classical methods didn't: torque-controllable joints (not just position), fast force sensing, and the real-time compute to solve the optimization at 100–1000 Hz. See the [real-time control systems guide](/posts/real-time-control-systems-ultimate-guide/) for why that timing budget is unforgiving. > **Rule of thumb:** If a humanoid walks flat-footed with permanently bent knees, it's running a conservative ZMP-style controller. If it heel-strikes, toes-off, and recovers from a shove, it's running torque-level WBC/MPC — and its joints can do force control. The gait tells you the control stack. ## The sensing suite A humanoid's sensing needs split into two jobs: **proprioception** (knowing its own body state, for balance and control) and **exteroception** (perceiving the world, for navigation and manipulation). For the full taxonomy see the [robot sensors guide](/posts/robot-sensors-ultimate-guide/) and, for the cameras specifically, the [LiDAR & depth cameras guide](/posts/lidar-depth-cameras-ultimate-guide/). ### Proprioception (the fast, essential layer) - **Joint position encoders** — one per joint, usually magnetic absolute encoders, feeding the kHz control loop. Non-negotiable. - **Joint torque sensing** — either dedicated torque sensors (harmonic-drive joints) or estimated from motor current (QDD joints). This is what enables force control and compliance. - **IMU(s)** — a 6- or 9-axis inertial measurement unit (often in the torso/pelvis) gives body orientation and angular rate, the backbone of balance. High-end designs run multiple IMUs for redundancy and to estimate limb states. - **Foot force / contact sensors** — load cells or pressure arrays in the soles to detect contact timing and force distribution. Critical for walking; surprisingly often skimped on. ### Exteroception (the slow, AI-facing layer) - **RGB cameras** — multiple, for the VLA model's eyes. Figure and Tesla lean heavily on cameras over LiDAR (the Tesla "vision-first" philosophy carried over). - **Depth** — stereo cameras or structured-light/ToF depth in the head and sometimes chest, for obstacle and object geometry. Some humanoids add a head LiDAR for mapping; many skip it to save mass and cost. - **Hand/wrist cameras** — close-range cameras for manipulation, since the head camera is occluded by the robot's own arms during a grasp. ```text Sensing rate budget (representative) Joint encoders / IMU: 1–10 kHz → real-time control loop Foot force / joint torque: 1 kHz → balance / WBC Depth cameras: 30–90 Hz → perception / mapping RGB to VLA model: 1–30 Hz → high-level policy The control loop is ~1000× faster than the "thinking" loop. That split is the whole architecture of the machine. ``` > **The honest take:** Proprioception is mature and cheap; you can buy excellent encoders and IMUs. The hard, expensive, immature sensing is *tactile* (covered with hands) and the *fusion* of vision into reliable action. Adding more cameras is easy; making the robot reliably understand what it sees is not. ## Power & thermal Runtime is the constraint that the launch videos quietly omit. A humanoid is a power-hungry machine carrying its own battery, and the physics is unforgiving. See the [robot power & batteries guide](/posts/robot-power-batteries-ultimate-guide/) for the cell-level detail. ### The numbers ```text Power budget — representative 60–70 kg humanoid Standing / idle (holding pose): ~150–500 W Walking (no load): ~500–1500 W Manipulation under load / lifting: ~1–3 kW peak Compute (AI SoC + controllers): ~100–500 W (constant!) Battery pack: ~1.0–2.3 kWh → Runtime: ~1–5 hr depending on duty cycle Energetics check: 2 kWh pack / 600 W average draw ≈ 3.3 hr 2 kWh pack / 1500 W heavy work ≈ 1.3 hr ``` Two things stand out. First, **compute is a constant tax** — a few hundred watts that never stops, even standing still, which is why an idle humanoid still drains. Second, **standing is not free**: holding a pose draws real current in QDD joints (the thermal trap again), so even "doing nothing" costs watts. Atlas-class robots doing dynamic motion can spike to several kW. ### Why runtime is hard to fix You can't just add battery — every kWh of lithium-ion is ~5–7 kg of mass the robot must then carry and accelerate, which raises every actuator's load, which raises power draw. There's a point of diminishing returns around 2–2.5 kWh for a human-sized robot. The practical answers are: - **Hot-swappable packs** (Apptronik Apollo's approach) — a human or a dock swaps a fresh pack in under a minute, so the robot's *duty cycle* approaches 24/7 even if a single charge is ~4 hr. - **Opportunity charging / docking** — the robot returns to a charger between tasks. - **Tethering** — viable for fixed industrial cells, useless for mobile work. ### Thermal management Beyond batteries, the actuators and compute generate heat that must go somewhere. Most 2026 humanoids use a mix of passive conduction through the structure, forced-air fans, and (increasingly) liquid cooling loops for the highest-power leg actuators and the AI compute. Thermal derating — the controller throttling torque to protect a hot motor — is a real and under-discussed limit on sustained work. > **The honest take:** "It walked for the whole demo" usually means ~1–4 hours of mixed activity, not a shift. Anyone promising all-day continuous operation from a single charge in a human-sized package is fighting energy density, and energy density isn't improving fast enough to win that fight in 2026. The realistic model is swap-and-charge, not run-forever. ## Onboard compute A humanoid runs two fundamentally different computers, often physically separate, because their requirements conflict. See the [real-time control systems guide](/posts/real-time-control-systems-ultimate-guide/) for why you cannot run both jobs on one stack. ### The split - **Real-time control layer** — runs the joint loops, balance, and whole-body control at **1–10 kHz** with hard deadlines. A missed deadline can mean a fall. This runs on microcontrollers (per-joint) and a central real-time SoC or RTOS host, deterministically. It does *not* run a general-purpose OS for the critical path. - **AI inference layer** — runs the VLA model, perception, and planning at **1–30 Hz**, soft real-time, on a GPU/AI SoC. Latency matters but a hiccup degrades behavior rather than dropping the robot. This is the classic "fast reflexes, slow deliberation" architecture, and it mirrors the sensing-rate split from earlier: the control loop is ~1000× faster than the thinking loop. ### The silicon The AI layer in 2026 commonly runs on **NVIDIA Jetson Thor** class hardware (high TOPS, automotive/robotics-grade, ~tens to low-hundreds of watts) or custom in-house silicon (Tesla, for instance, leverages its own inference accelerators). The numbers vendors care about: - **TOPS / FLOPS** for VLA inference throughput. - **Memory bandwidth and capacity** — modern VLA models are large; getting them on-device and fast is a real constraint. - **Power and thermal** — every watt of compute is a watt off the battery and heat to reject (see the power section). The real-time layer is unglamorous by comparison — ARM Cortex-R/M class microcontrollers and a deterministic bus (EtherCAT, CAN-FD, or a custom high-rate link) tying the joints together. > **Rule of thumb:** If a humanoid's AI compute is on-board (not streamed to a server), it's spending 100–500 W continuously and rejecting that as heat. Cloud-offloading the AI saves power and heat but adds latency and a connectivity dependency that's unacceptable for balance-critical loops — which is why the *control* layer is always local, no matter what. ## The teleoperation reality This is the section the rest of the industry would prefer you skip. Teleoperation — a human remotely driving the robot, often via a VR headset and hand-tracking gloves or a motion-capture rig — is pervasive in humanoid robotics, and it plays two very different roles. ### The legitimate role: data collection VLA models need demonstrations — thousands of hours of a robot doing the task, with the exact sensor inputs and motor outputs. The cleanest way to generate that data is to have a human *teleoperate the actual robot* through the task many times. The robot's body experiences the real physics; the human provides the intelligence; the recordings train the policy. This is honest, necessary, and how most current manipulation policies are bootstrapped. Sanctuary, 1X, Figure, and Tesla all run large teleop data operations. ### The dishonest role: faking autonomy The same teleop rig, pointed at a camera, produces a video of a robot "autonomously" folding laundry or fetching a drink — when in fact a person in the next room is driving every motion. Sometimes it's disclosed in fine print; often it isn't. Other times the demo is genuinely autonomous but is a narrow policy that *only* works on that exact scene, lighting, and object set, and would fail if you moved a cup 10 cm. ### How to read a humanoid demo critically > **The honest take — the teleop tell-sheet:** > - **Smooth, confident, human-paced manipulation** with no hesitation? Likely teleoperated. Autonomous policies in 2026 are jerky, slow, and pause to "think." > - **A single uncut take of a long task chain?** Strong autonomy signal — or strong teleop signal. Look closer. > - **No mention of autonomy in the caption?** Assume teleop. Companies that achieve autonomy say so loudly and specifically. > - **The robot recovers from an unexpected perturbation** (someone moves an object mid-task)? That's hard to fake and a real autonomy signal. > - **Cuts between every action?** Each segment may be a separate take, retried until it worked. > - **"X% autonomous" or "speed 1.0x" captions?** Companies started adding these because the credibility problem got bad enough to address. Reward the disclosure; don't assume its absence means autonomy. > - **Same scene, same objects, same lighting every time?** Probably a scene-specific policy, not generalization. None of this means teleop is bad — it's a vital tool. It means you should never infer *autonomy* from a *demo* without explicit, specific disclosure. The gap between "the robot can physically do this" and "the robot decided to do this by itself" is the entire unsolved problem, and demos are designed to blur it. ## Manufacturing & cost The thesis that makes humanoids an investable category is **cost at volume**: that a useful humanoid can be built for under $50k, and eventually under $20k, putting it below the multi-year cost of the human labor it might augment. Whether that's true is a manufacturing question, and manufacturing is where Tesla and the automakers think they have an edge. ### Where the money goes From the BoM table earlier, **actuators and hands dominate** — together commonly 50–70% of hardware cost. This is the opposite of consumer electronics, where silicon dominates. A humanoid is an *electromechanical* product, so its cost curve is set by motors, gearboxes, screws, bearings, and precision assembly — not by chips, which are comparatively cheap and commoditized. ### The levers to <$50k - **Vertical integration of actuators.** Buying off-the-shelf harmonic drives and servo motors is expensive at low volume. Designing your own actuators (Tesla, Figure, Boston Dynamics) lets you optimize per-joint, remove margin stacking, and design for high-volume production. This is the single biggest cost lever. - **Design for manufacture (DfM).** Reducing part count, using castings/stampings over machined parts, standardizing actuators across joints (one or two actuator "sizes" reused everywhere), and minimizing fasteners and wiring. - **Volume.** Most of the <$20k story is amortization — tooling, automation, and supply-chain scale that only pay off at tens of thousands of units per year. At hundreds of units, every humanoid is effectively hand-built and costs 5–10× the target. - **Simplify the hard parts.** The fastest way to cut the BoM is to ship simpler hands and fewer DoF. Much of the price spread between robots is a hand-complexity decision. ### What does *not* drive cost down Exotic materials and clever lightweighting are mostly a distraction at this stage — carbon fiber and titanium add cost, not remove it. The robots winning on cost (Unitree) win through aggressive supply-chain leverage and accepting lower-end performance, not materials science. > **The honest take:** The <$20k humanoid is a *volume* claim, not a *technology* claim. The technology to build a $20k humanoid exists today; the volume to make it cost $20k does not. Until someone is shipping tens of thousands per year, treat sub-$30k price tags as roadmap, not reality. Unitree's ~$16k G1 is real, but it's a lightweight research platform, not a 25 kg-payload labor robot — different product, different cost basis. ## The 2026→2027 outlook Putting the subsystems together, here's a defensible read on where this goes near-term. ### What's real - **The hardware works.** Walking, balancing, two-arm coordination, basic grasping, and dynamic recovery are demonstrated and reproducible across multiple vendors. The body is no longer the blocker. - **Structured commercial deployment.** Warehouses, fixed manufacturing cells, and other bounded environments will see real, paid humanoid (and humanoid-adjacent) work expand. Agility Digit is the template: pick a narrow job, nail it, scale it. - **Teleop-driven data flywheels.** The companies collecting the most real-robot demonstration data are building a genuine moat, because that data trains the policies that close the autonomy gap. ### What's hype - **The general home robot.** A humanoid that autonomously handles arbitrary household tasks reliably is *not* a 2026–2027 product. The unstructured home is the hardest environment and the furthest from being solved. - **Sub-$20k price tags at useful capability.** Roadmap, not reality, until volume manufacturing exists. - **Most "autonomous" manipulation reels.** See the teleop section. Discount accordingly. ### Where the bottlenecks are The bottleneck has moved off the actuator and onto **software and data**: 1. **Generalization** — policies that work outside their training distribution. This is the big one. 2. **Manipulation reliability** — dexterous, robust grasping of arbitrary objects, which needs better hands *and* better tactile-informed policies. 3. **Data** — enough high-quality real-robot demonstrations to train general policies, which is why teleop data ops are a strategic asset. 4. **Cost-at-volume** — a manufacturing and capital problem, downstream of demand that depends on (1)–(3). > **The honest take for 2026→2027:** Expect impressive, narrowing-scope commercial deployments and continued spectacular demos. Expect the autonomy gap to close *gradually*, not in a single breakthrough. The companies that win will be the ones quietly grinding on data and reliability in boring structured environments — not the ones with the best laundry-folding video. The hardware race is largely over; the data-and-software race is just getting started. ## Frequently asked questions **How many degrees of freedom does a typical humanoid robot have?** Most capable 2026 humanoids have **28–60 actuated DoF**. The body (legs, arms, torso, neck) is usually ~28–32 DoF; hands can add anywhere from 12 (two simple 6-DoF hands) to 40+ (two anthropomorphic hands), which is why total counts vary so widely. When comparing robots, separate body DoF from hand DoF — vendors inflate headline numbers with finger joints. **What is the hardest part of building a humanoid robot?** The hardware answer is **actuators** (torque density, efficiency, backdrivability, thermal limits) and **hands** (dexterity in a tiny, expensive package). The system answer is **autonomy** — letting the robot reliably decide and execute tasks in unstructured environments. In 2026 the body is largely solved; the brain and the data to train it are the bottleneck. **Are humanoid robot demos real or teleoperated?** Many are teleoperated, either openly (as legitimate data collection) or misleadingly (faking autonomy). Smooth, fast, confident manipulation with no hesitation is a teleop tell; jerky, slow, pausing behavior and recovery from unexpected perturbations are autonomy signals. Never infer autonomy without explicit, specific disclosure. **Why rotary vs. linear actuators in humanoids?** Rotary quasi-direct-drive (QDD) actuators are backdrivable and give force control "for free" from motor current — great for dynamic, contact-rich joints (hips, ankles, shoulders). Linear ball-screw actuators give very high force density and hold static loads efficiently — great for high-load joints like knees. Tesla's Optimus deliberately uses both, choosing per-joint. There's no single winner. **How long can a humanoid robot run on one charge?** Typically **1–5 hours**, depending on duty cycle, from a ~1–2.3 kWh battery. Standing draws a few hundred watts (including constant compute), walking ~0.5–1.5 kW, and heavy manipulation can spike to several kW. Continuous all-day operation realistically requires hot-swappable battery packs or docking, not a single charge. **How much do humanoid robots cost in 2026?** Research platforms like Unitree G1 start around **$16k**; capable labor-oriented humanoids are far more (Unitree H1 ~$90k+; others undisclosed). Targets of <$50k and eventually <$20k are *volume manufacturing* claims that depend on producing tens of thousands of units per year — they are roadmap, not 2026 pricing for a high-payload robot. **What sensors does a humanoid robot use?** Two layers. Proprioception (fast, essential): joint encoders, joint torque sensing or motor-current estimation, one or more IMUs, and foot force/contact sensors. Exteroception (for AI): multiple RGB cameras, depth (stereo/ToF, sometimes head LiDAR), and wrist/hand cameras for manipulation. Proprioception is mature and cheap; tactile sensing and vision-to-action fusion are the hard, immature parts. **Why are robot hands so difficult and expensive?** You can't fit motors in human-sized fingers, so actuation moves to the forearm and transmits via tendons (compact but maintenance-heavy) or linkages (durable but bulky). Add tactile sensing, high DoF, and low production volume, and a pair of dexterous hands can cost as much as both legs. Most shipping humanoids use simplified hands precisely because the cost-and-control burden of full dexterity isn't yet worth it. **Is bipedal walking a solved problem?** Flat-floor walking is essentially solved and has been for years. **Robust** walking — over debris, slopes, and stairs, while carrying a load and resisting pushes — is not. It requires torque-controllable joints, fast foot-force sensing, and whole-body/model-predictive control running at high rate. If a robot heel-strikes and recovers from shoves, it's running modern torque-level control; if it walks flat-footed with bent knees, it's running a conservative ZMP-style controller. **What compute does a humanoid need?** Two computers. A real-time control layer (1–10 kHz, hard deadlines, on MCUs/RTOS) for balance and joint control, and an AI inference layer (1–30 Hz, soft real-time, on a GPU/SoC like NVIDIA Jetson Thor or custom silicon) for the VLA model and planning. The control loop runs ~1000× faster than the thinking loop, and the AI layer draws 100–500 W continuously. **Which humanoid robot is the most advanced?** "Advanced" depends on the axis. Boston Dynamics Atlas (electric) leads on dynamic athleticism and range of motion; Tesla Optimus and Figure lead on the manufacturing-and-AI integration thesis; Unitree leads on cost and accessibility. Commercially, **Agility Digit** is furthest along in paid real-world deployment precisely because it targets a narrow, structured warehouse job rather than general capability. **Will humanoids replace human workers in 2026–2027?** Not broadly. Expect them in bounded, structured commercial settings (warehouses, fixed manufacturing cells) where the task is well-defined, and slow progress in open-ended environments like homes. The bottleneck is autonomy and reliability, not bodies. Treat near-term deployment as task-specific augmentation, not general labor replacement. ## Changelog - **2026-05-21** — Initial publication. --- # Legged & Quadruped Robot Hardware: The Ultimate Guide URL: https://blog.robo2u.com/posts/legged-quadruped-robot-hardware-ultimate-guide/ Published: 2026-05-19 Updated: 2026-06-20 Tags: quadruped-robots, legged-robots, spot, unitree, anymal, quasi-direct-drive, locomotion, mit-cheetah, robotics-hardware, guide Reading time: 36 min > An engineer's deep dive into legged and quadruped robot hardware — QDD actuators, leg kinematics, gaits, sensing, power, and the 2026 roster (Spot, Unitree, ANYmal) with real numbers and selection guidance. A wheel is a beautiful solution to a flat-world problem. The moment the world stops being flat — stairs, rubble, mud, a 200 mm curb, a catwalk in a substation — the wheel's elegance becomes a liability and you start wishing you had feet. Legged robots exist to put a foot exactly where they choose, ignore everything in between, and keep a payload level while the ground beneath does whatever it wants. This is the long version of how that hardware actually works. We'll go through why you'd pick legs at all, the 2026 quadruped roster you can actually buy, leg kinematics and the standard 3-DoF leg, the quasi-direct-drive (QDD) actuator revolution that made dynamic legged robots practical, the gaits and control rates that drive the hardware spec, sensing and state estimation, power and runtime, why four legs is genuinely easier than two, the honest applications, and how to choose or build one. Real numbers with units, real products, opinions with reasons attached. **The take**: Legged robots are not better than wheels — they are more expensive, less efficient, and less reliable per meter traveled, and they win only when the terrain denies wheels entirely. What changed between 2015 and 2026 is not that legs got cheaper to *run* but that they got cheaper to *build*: the MIT Cheetah insight — a low-ratio brushless motor running field-oriented control is a backdrivable, force-controllable, impact-tolerant actuator — collapsed the cost and complexity of a usable leg by an order of magnitude, and Unitree turned that into a sub-$3,000 quadruped. The actuator is the whole story; everything else is plumbing around it. Companion reading: [robot actuators](/posts/robot-actuators-ultimate-guide/), [quasi-direct-drive & BLDC motors](/posts/brushless-dc-motors-bldc-ultimate-guide/), [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/), and [humanoid robot hardware](/posts/humanoid-robot-hardware-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [Why legs at all](#why-legs) 3. [The 2026 quadruped roster](#roster) 4. [Leg design & kinematics](#kinematics) 5. [The QDD actuator revolution](#qdd) 6. [Why QDD beat geared-plus-sensor legs](#qdd-vs-geared) 7. [Gaits & dynamics: what the hardware must do](#gaits) 8. [Sensing for locomotion](#sensing) 9. [Balance & control: MPC, WBC, and RL](#control) 10. [Power & runtime](#power) 11. [Bipeds vs quadrupeds](#bipeds) 12. [Applications & honest ROI](#applications) 13. [Building or selecting a legged robot](#building) 14. [Frequently asked questions](#faq) ## Key takeaways - Legs win only when terrain denies wheels. On flat ground a wheeled robot beats a legged one on efficiency, speed, payload, reliability, and cost by wide margins. Pick legs for stairs, rubble, gaps, and unstructured outdoor terrain — not because they look impressive. - The **cost of transport (CoT)** is the honest scoreboard. A car sits around 0.1–0.3, a walking human ~0.2, Boston Dynamics Spot roughly 0.5–0.7, and the original hydraulic-era legged robots were far worse. Legs pay an energy tax for the privilege of choosing footholds. - The **standard quadruped leg has 3 actuated degrees of freedom**: hip abduction/adduction (roll), hip flexion/extension (pitch), and knee flexion. Twelve actuators total. That is the minimum to place a foot in 3D and control body pose. - The **QDD actuator** — a high-pole-count BLDC motor, a single-stage 6:1–10:1 planetary gear, and field-oriented current control — is the enabling technology. It is backdrivable, lets you estimate joint torque from motor current without a torque sensor, survives impacts, and runs control loops at 1+ kHz. - QDD beat the old approach (high gear ratio + dedicated torque/force sensors) on transparency, impact tolerance, bandwidth, and cost. The gearbox-ratio sweet spot for legs is **roughly 6:1 to 10:1**. - Dynamic gaits (trot, bound, flying trot) need **fast torque loops — 1 kHz at the joint, hundreds of Hz for the body controller** — because the robot is statically unstable and recovers by accelerating the legs. - The state estimate is mostly **proprioceptive**: IMU + joint encoders + a leg kinematic/contact model fused in an EKF give body velocity and orientation. Exteroception (depth cameras, LiDAR) is for terrain ahead, not for staying upright. - The control stack in 2026 is a layered mix: **model predictive control (MPC)** or **whole-body control (WBC)** for model-based platforms, increasingly displaced or augmented by **reinforcement-learning policies trained in simulation** and transferred sim-to-real. - Runtime is **1–4 hours** for most commercial quadrupeds; legs are energy-hungry and battery is heavy. Hot-swappable packs and dock-charging are how fleets stay useful. - **Four legs is genuinely easier than two**: a quadruped can keep three feet down (a stable tripod) during slow gaits and never has to balance on a single contact. Bipeds are always one bad step from falling. Quadrupeds are the proving ground for the actuators and control that humanoids inherit. - The real applications are **inspection, security patrol, mapping, and research** — not households. ROI is real in industrial inspection where the alternative is sending a person into a hazardous or remote site repeatedly. - **Unitree broke the price floor.** A research-grade quadruped went from ~$75,000 (Spot-class) to ~$1,600 (Unitree Go2 base) between 2019 and 2024, reshaping who can do legged-robot research. ## Why legs at all Start with the uncomfortable truth: for almost every job a mobile robot does, wheels are the right answer. They're efficient, simple, cheap, and reliable. If you're moving boxes across a warehouse floor, building a legged robot to do it is engineering malpractice. See the [mobile robots (AMR/AGV) guide](/posts/mobile-robots-amr-agv-ultimate-guide/) for the world where wheels rightly dominate. Legs earn their place on exactly one axis: **terrain that wheels and tracks cannot negotiate.** Discrete footholds. A robot with legs touches the ground only where it chooses to, and ignores everything in between. A wheel must roll over (or fail to roll over) every point along its path; a leg steps across the bad parts. That is the entire value proposition, and it is a real one for stairs, rubble fields, gaps, steep loose slopes, and the cluttered interiors of industrial plants designed for humans. ### The cost-of-transport tax The price of that capability is energy. The standard dimensionless metric is the **cost of transport (CoT)**, also called specific resistance: ``` CoT = E / (m · g · d) E = energy used to travel distance d [J] m = total mass [kg] g = 9.81 m/s^2 d = distance traveled [m] Lower is better. CoT is dimensionless. Reference points: Freight train ~0.02 Bicycle (human) ~0.05 Automobile ~0.1 - 0.3 Walking human ~0.2 Wheeled mobile robot ~0.1 - 0.3 Boston Dynamics Spot ~0.5 - 0.7 (electric, modern) Early legged robots >1.0 - 3.0 (hydraulic era) ``` > Rule of thumb: a modern electric quadruped costs roughly **2–5× more energy per meter** than a wheeled robot of similar mass on flat ground. You are buying terrain access with battery. The hydraulic-era machines (early Atlas, BigDog) were far worse — CoT often above 1.0 — because hydraulic power units dump enormous energy as heat. The shift to electric QDD actuators is the single biggest reason CoT dropped into the 0.5 range, which is what made battery-powered legged robots useful for more than a demo. ### When legs actually win Be honest with yourself about the use case. Legs win when **all** of these are true: the terrain is genuinely non-wheelable, the mission tolerates 1–4 hour runtimes, and the value of the data or task at the far end justifies a $30k–$150k machine. That describes substation and oil-and-gas inspection, underground mining, disaster response, construction site monitoring, and research. It does not describe warehouse logistics, last-mile delivery on sidewalks (wheels plus a small step-climb mechanism usually win), or your living room floor. There's also a hybrid answer worth respecting: **wheeled legs** (wheels on the end of articulated legs, like ANYbotics' and Swiss-Mile's research platforms, or the DEEP Robotics wheeled variants). These roll efficiently on flat ground and walk only when they must, clawing back much of the CoT gap. If your environment is 90% flat with occasional steps, that's often the smart hardware choice. ## The 2026 quadruped roster Here is the landscape you can actually procure in 2026, from premium industrial to disruptive consumer-research. Numbers are manufacturer-published or well-established field figures; treat price especially as approximate and configuration-dependent. | Robot | Mass | Payload | Top speed | Runtime | DoF | Indicative price | |---|---|---|---|---|---|---| | Boston Dynamics **Spot** | ~32–34 kg | ~14 kg | ~1.6 m/s | ~90 min | 12 | ~$75,000+ | | Unitree **Go2** (Air/Pro/EDU) | ~15 kg | ~8 kg | up to ~3.5–5 m/s | ~1–2 h | 12 | ~$1,600–$16,000 | | Unitree **B2** | ~60 kg | ~40 kg (up to ~120 kg static) | ~6 m/s | ~2–4 h | 12 | ~$100,000 | | Unitree **A1** (legacy) | ~12 kg | ~5 kg | ~3.3 m/s | ~1–2.5 h | 12 | ~$10,000 (discontinued) | | ANYbotics **ANYmal** (D/X) | ~50 kg | ~10–15 kg | ~1.3 m/s | ~2–4 h | 12 | ~$150,000+ | | Ghost Robotics **Vision 60** | ~51 kg | ~10–14 kg | ~3 m/s | ~3 h | 12 | ~$100,000+ | | DEEP Robotics **X30** | ~56 kg | ~20 kg | ~4 m/s | ~2.5–4 h | 12 | ~$50,000+ | | MIT **Mini Cheetah** (research) | ~9 kg | small | ~2.5+ m/s | ~lab | 12 | research platform | A few editorial notes on this table: **Spot** is the reference design for industrial inspection: rugged, IP54, a mature SDK, a real payload ecosystem (the Spot CAM, the arm, third-party sensor packages), and the only one with a serious commercial deployment story across dozens of industries. You pay for the ecosystem and the reliability, not the raw specs. **Unitree** is the disruptor. The Go2 at consumer prices put a capable QDD quadruped in every robotics lab's budget, and the B2 is a serious industrial machine at a fraction of Western pricing. The catch is the export, support, and data-governance questions that make some Western industrial and defense buyers nervous. **ANYmal** (a spinout from ETH Zurich) is the research-pedigree industrial platform — exceptional terrain capability, strong autonomy stack, IP67-class sealing for harsh industrial environments, and the deepest published academic record (it's the platform behind much of the leading RL-locomotion research). **Ghost Robotics Vision 60** leans into defense and security: rugged, all-weather, and notable for designs that tolerate operating inverted and self-righting. **DEEP Robotics** (X30, Lite3, Lynx wheeled-leg) is the other strong Chinese player, with a focus on industrial inspection and an impressive stair/terrain record. ## Leg design & kinematics ### The standard 3-DoF leg Almost every modern quadruped uses the same leg topology: **three actuated joints per leg**, twelve total. 1. **Hip abduction/adduction (HAA)** — roll axis, swings the whole leg outward and inward from the body. This is what lets the robot widen its stance for stability and shift weight laterally. 2. **Hip flexion/extension (HFE)** — pitch axis, swings the upper leg (thigh) forward and back. The main propulsion joint. 3. **Knee flexion/extension (KFE)** — pitch axis at the knee, folds the lower leg (shank). Sets foot height and, with the hip, foot reach. Three DoF is the minimum to place the foot anywhere in a 3D workspace and still have enough control authority over body roll, pitch, and height. You *can* build 2-DoF legs (cheaper, planar-only, fine for a toy or a treadmill experiment), but you give up lateral balance and the ability to recover from sideways pushes. Nobody serious ships 2-DoF. ### Serial vs parallel, and where the motors live Two big architectural choices shape the leg: **Where you put the actuators.** The dynamics-friendly trick — pioneered hard by MIT Cheetah and adopted widely — is to **co-locate the heavy motors near the hip/body and drive the knee through a linkage or belt**, so the lower leg is light. Leg swing dynamics are dominated by the inertia of the distal links; a light shank means the leg can be accelerated fast (essential for dynamic gaits) and means less energy lost on every step. Spot, Unitree, and ANYmal all cluster mass proximally. **Serial vs parallel linkage.** A serial leg stacks joint-on-joint (motor at hip, motor at knee mounted on the thigh). A parallel/coaxial design mounts both pitch motors at the hip and drives the knee through a four-bar or a pushrod, keeping the shank a near-massless strut. Parallel mechanisms reduce distal inertia at the cost of kinematic complexity and a workspace that's harder to reason about. Most high-performance quadrupeds use some parallel element for the knee. ### The leg Jacobian: turning torque into foot force The reason QDD legs can do force control without a force sensor lives in the **leg Jacobian**, which maps joint velocities to foot velocity and (by the transpose) joint torques to foot force: ``` Foot velocity: v_foot = J(q) · q_dot Foot force <-> joint torque: tau = J(q)^T · F_foot q = joint angles [rad] (e.g. [HAA, HFE, KFE]) J(q) = leg Jacobian (3x3 for a 3-DoF leg) v_foot = foot Cartesian velocity [m/s] tau = joint torques [N·m] F_foot = Cartesian foot force [N] Because a QDD joint lets you estimate tau from motor current, you can read foot force F_foot = J^-T · tau and command it back through tau = J^T · F_foot_desired — no load cell at the foot. ``` > Key insight: with backdrivable, torque-transparent joints, the *whole leg becomes a programmable spring/damper.* You command a desired foot force as a function of foot position and velocity (an impedance), and the robot lands soft, absorbs impacts, and conforms to terrain — all in the actuator, no fancy feet required. This is also why motion planning for legged robots is its own discipline: you're not just placing a foot, you're choosing footholds, swing trajectories, and contact forces simultaneously. See the [motion planning & kinematics guide](/posts/motion-planning-kinematics-ultimate-guide/) for the trajectory and inverse-kinematics machinery underneath. ## The QDD actuator revolution If you remember one thing from this guide, remember this section. The quasi-direct-drive actuator is *the* reason legged robots went from million-dollar lab curiosities to $3,000 commodities. ### The MIT Cheetah insight The conventional robotics actuator is a small, fast motor behind a high-ratio gearbox (50:1, 100:1, even 160:1 harmonic drives — see the [gearboxes guide](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/)). That gives you enormous torque from a tiny motor, beautiful position accuracy, and a joint that holds position with the power off. It is the right answer for an industrial arm. It is the *wrong* answer for a leg, and the MIT Biomimetic Robotics Lab (Sangbae Kim's group) made the argument concrete around 2013–2018. A leg has to do three things a high-ratio gearbox is terrible at: 1. **Survive impacts.** Every footfall is a collision. A high-ratio gearbox reflects the motor's inertia to the output multiplied by the ratio *squared* — the joint feels enormously heavy and brittle on impact, and the gear teeth take the shock. 2. **Be backdrivable.** A leg must yield to the ground, not fight it. High-ratio gears (especially harmonic and worm) are barely backdrivable; the leg behaves like a rigid stick. 3. **Control force fast and cleanly.** Force control through a stiff high-ratio gearbox means bolting on a torque sensor and closing a loop around its noise and the gearbox's friction/backlash. The QDD answer: **use a big, high-torque, low-KV brushless motor and a single-stage planetary gearbox with a low ratio — roughly 6:1 to 10:1.** Run it with [field-oriented control (FOC)](/posts/motor-controllers-foc-ultimate-guide/), which lets you command motor *torque* directly (torque is proportional to quadrature-axis current). Now the gear ratio is low enough that: - The motor is **backdrivable** through the gearbox by hand. - Joint torque is **proportional to motor current**, which you already measure for FOC. **You get a torque sensor for free** — proprioceptive torque sensing. - Reflected inertia is small, so the joint **tolerates impacts** and the control loop sees a clean, near-linear plant. ``` Reflected inertia at the joint output: J_reflected = N^2 · J_motor + J_gear_output N = gear ratio J_motor = motor rotor inertia [kg·m^2] Because reflected inertia scales with N^2, dropping from a 100:1 harmonic drive to an 8:1 planetary cuts the reflected rotor inertia by ~(100/8)^2 ≈ 156x. That is the difference between a leg that shatters on impact and one that bounces. ``` ``` Backdrive torque (torque you must apply at the output to move the motor backward through the drive): tau_backdrive ≈ (J_motor · N · alpha_out) / eta_backdrive + friction terms Low N and high gearbox efficiency (eta) keep this tiny. For an 8:1 single-stage planetary at ~90% efficiency the leg backdrives with a few N·m — you can push it with one hand. A 100:1 harmonic drive may need tens of N·m and a lot of breakaway friction; effectively non-backdrivable. ``` For more on the motor and drive side of this, see the [BLDC motors guide](/posts/brushless-dc-motors-bldc-ultimate-guide/) (pole count, KV, torque density) and the [FOC motor-controllers guide](/posts/motor-controllers-foc-ultimate-guide/) (how current becomes torque at 20+ kHz). ### What a real QDD module looks like A modern QDD leg module — MIT Cheetah's actuator, Unitree's GO-M8010, the open-source MJBots qdd100, or T-Motor's AK-series — is a tidy package: - A **large-diameter, high-pole-count (often 14–21 pole-pair) outrunner BLDC**, optimized for torque density at low speed. - A **single-stage planetary gearbox, 6:1–10:1**, with low friction and good backdrive efficiency. - An **integrated FOC drive** on a board inside the housing, talking CAN or EtherCAT. - **Two encoders** — one on the rotor (commutation + velocity), one on the output (absolute joint angle), so you read both motor and joint position. See the [encoders guide](/posts/encoders-ultimate-guide/). - Continuous torque on the order of **15–35 N·m** with **peak torque 2–4× that** for impact and dynamic moves, in a package weighing **~0.5–1.0 kg**. That last point matters: per-actuator torque density (N·m/kg) is the spec that sizes the whole robot. Higher torque density means a lighter leg, which means lower distal inertia, which means faster, more dynamic gaits. It's a virtuous loop the whole industry is climbing. ## Why QDD beat geared-plus-sensor legs It's worth laying the two philosophies side by side, because the choice isn't obvious until you've felt both fail. | Property | High-ratio gearbox + torque/force sensor | QDD (low ratio + FOC, proprioceptive) | |---|---|---| | Gear ratio | 50:1 – 160:1 (harmonic) | 6:1 – 10:1 (single-stage planetary) | | Backdrivability | Poor to none | Excellent | | Torque sensing | Dedicated sensor (load cell / strain gauge) | From motor current — "free" | | Impact tolerance | Low — gear teeth + sensor take shock | High — low reflected inertia, motor cushions | | Control bandwidth | Limited by sensor noise + gearbox dynamics | High — clean near-linear plant, 1+ kHz | | Reflected inertia | High (∝ N²) | Low | | Position accuracy | Excellent | Good (needs output encoder) | | Efficiency (steady load) | High at the gearbox; motor small | Lower gear loss; motor runs harder | | Cost / complexity | High (precision gears + sensors) | Lower (commodity motor + board) | | Holds position, power off | Yes (self-locking) | No — must hold with current | | Best for | Precise arms, slow heavy joints | Dynamic legs, contact-rich motion | The geared-plus-sensor approach isn't wrong — it's exactly right for a precision industrial arm, where you want stiffness, accuracy, and the joint to hold position when de-energized. It's wrong for a *leg*, where the dominant requirements are impact survival, transparency, and torque bandwidth. > The gearbox-ratio sweet spot for legs is roughly **6:1 to 10:1.** Below ~6:1 you can't get enough torque without a huge, heavy motor. Above ~10:1 you start losing backdrivability and gaining reflected inertia, and you're sliding back toward the geared-arm regime. Most QDD leg modules cluster at 7:1–9:1. There's a cost to QDD honesty: because the joint is *not* self-locking, the robot burns current just to stand still holding a pose (gravity compensation), and it can't go limp-but-locked when powered off. That standing-power cost is a real chunk of the runtime budget and one reason legged robots crouch and sit when idle. ## Gaits & dynamics: what the hardware must do The gait you want determines the control rate you need, which determines the actuator bandwidth you must buy. Hardware follows from dynamics. ### Static vs dynamic gaits A **static gait** keeps the robot's center of mass inside the support polygon (the convex hull of feet on the ground) at all times. A quadruped walking by lifting one leg at a time always has a stable tripod under it. It's slow, safe, and — crucially — doesn't require fast control. A static crawl can be run at modest loop rates and survives clumsy hardware. This is how you climb a ladder-like obstacle carefully. A **dynamic gait** — trot (diagonal pairs), pace, bound, gallop, pronk — deliberately leaves the robot *statically unstable* for part of the cycle. During a flying trot both diagonal pairs may briefly leave the ground. The robot doesn't fall because it's continuously catching itself: the controller predicts where the body is going and places the next foot to redirect it. This is fast (the 3–6 m/s top speeds in the roster come from dynamic gaits) and it is hard. ### Why you need 1 kHz torque loops Dynamic balance is a race against gravity. A toppling body accelerates; the longer your control loop's period, the further it's fallen before you react, and the harder the correction. Concretely: - The **low-level joint torque loop runs at ~1 kHz** (1 ms period). This is the loop that takes a desired joint torque and commands the FOC current controller. (The FOC current loop *underneath* it runs far faster, ~10–40 kHz.) - The **whole-body / MPC controller runs at ~100–500 Hz**, recomputing desired contact forces and body trajectory. - A **footstep / gait planner runs at ~10–50 Hz**, deciding where feet go. > Rule: if your joints can't accept new torque commands at 1 kHz with low latency, you cannot do robust dynamic locomotion. This is why legged robots use [real-time control systems](/posts/real-time-control-systems-ultimate-guide/) — deterministic timing on CAN/EtherCAT buses and an RTOS or PREEMPT_RT Linux. Jitter is the enemy; a 5 ms hiccup at the wrong moment is a fall. The QDD actuator earns its keep here too: a clean, low-inertia, near-linear joint plant is *controllable* at 1 kHz. A high-ratio geared joint with backlash and sensor lag fights you at those rates. ## Sensing for locomotion A walking robot needs to answer two questions continuously: *where is my body and how is it moving?* (proprioception) and *what does the ground ahead look like?* (exteroception). The first keeps it upright; the second lets it choose footholds. See the [robot sensors guide](/posts/robot-sensors-ultimate-guide/) for the full sensor taxonomy. ### The proprioceptive state estimate This is the heart of staying upright, and it's almost entirely **internal** sensing: - **IMU** (a 6-axis or 9-axis MEMS unit at the body) — gives angular rate and linear acceleration at high rate (hundreds of Hz to kHz). It's the fastest indicator of body orientation and motion, but it drifts when integrated. - **Joint encoders** — one per actuated joint (and ideally a second at the output, as the QDD module provides). These give exact leg geometry, so via forward kinematics you know where each foot is relative to the body. See the [encoders guide](/posts/encoders-ultimate-guide/). - **Foot contact sensing** — whether a foot is loaded. Some robots use explicit contact switches or foot force sensors; many QDD robots infer contact from *joint torque* (the foot pushing back shows up as torque you can read from current). Knowing which feet are stance feet is essential for the estimator. These fuse in an **extended Kalman filter (EKF)** (or a factor-graph estimator) that combines IMU integration with leg-kinematic "velocity measurements": when a foot is firmly planted, the kinematics tell you the body's velocity relative to that fixed contact, which corrects the IMU drift. The output is a continuously updated estimate of body position, velocity, orientation, and angular rate at 500 Hz–1 kHz. **No camera required to balance** — and that's by design, because vision is too slow and too failure-prone to depend on for not falling over. ### Exteroception for terrain To choose *where* to step, the robot needs to see the ground ahead: - **Depth cameras** (Intel RealSense-class stereo/active IR) on the body and pointing down-forward, building a local heightmap of the terrain. - **LiDAR** (often a compact spinning or solid-state unit) for longer range, mapping, and SLAM. ANYmal and Spot lean on LiDAR for autonomous navigation and inspection mapping. - Increasingly, **learned terrain perception** that turns raw depth into a traversability/heightmap the foothold planner consumes. See the [LiDAR & depth cameras guide](/posts/lidar-depth-cameras-ultimate-guide/) for the sensing tradeoffs. The important architectural point: exteroception is *advisory*. The robot blends a perceived heightmap with proprioceptive feedback, and a good controller falls back gracefully to "blind" locomotion (feeling the terrain through the legs) when the camera is blinded by dust, glare, or fog. The best 2026 RL policies are explicitly trained to walk blind and use vision only to anticipate. ## Balance & control: MPC, WBC, and RL The control stack is where the field is moving fastest. Two broad lineages, increasingly blended. ### Model-based: MPC and whole-body control The classical, model-based approach reasons explicitly about physics: - **Model predictive control (MPC)** treats the body as a (often simplified) rigid mass and predicts its motion over a short horizon (say 0.5–1 s), solving an optimization at each tick (~100–500 Hz) for the contact forces that keep it on a desired trajectory while respecting friction-cone constraints (feet can push, not pull, and can't slip). A common simplification is the **single rigid body model** with point-foot contacts. - **Whole-body control (WBC)** takes MPC's desired body wrench and resolves it into joint torques across all 12 actuators, respecting the full robot dynamics and prioritized tasks (keep the body level, track the swing-foot trajectory, don't exceed torque limits). This stack is interpretable, tunable, and what Boston Dynamics, ANYbotics, and most academic platforms ran for years. Its weakness is that it's only as good as the model, and modeling contact, compliance, and weird terrain is hard. ### Learning-based: RL trained in sim The dominant trend since roughly 2019–2022, pioneered heavily on ANYmal at ETH Zurich and now ubiquitous: **train a neural-network control policy in massively parallel physics simulation (Isaac Gym / Isaac Lab and friends), then deploy it on the real robot.** The policy maps proprioceptive state (and optionally a terrain heightmap) directly to joint targets, at the same ~1 kHz the model-based stack uses. The appeal is robustness: you simulate thousands of robots across randomized terrain, friction, mass, and disturbances, and the policy learns to handle a distribution of conditions no hand-tuned controller could enumerate. ### The sim-to-real story The catch is the **reality gap**: a policy that's perfect in sim can fail on hardware because the simulator's contact, friction, actuator dynamics, and latency don't match reality. The techniques that close it: - **Domain randomization** — randomize masses, friction, motor gains, latency, terrain so the policy can't overfit to one physics. - **Actuator-network modeling** — learn a model of the *real* QDD actuator's torque response (including its quirks) and put that in the sim loop. This was a key ANYmal contribution. - **Teacher–student / privileged learning** — train a "teacher" with full sim knowledge, then distill a "student" that uses only the sensors the real robot has. > Why QDD makes RL practical: the policy outputs torques (or joint targets the joint tracks with torque), and a transparent, near-linear QDD joint behaves enough like the simulated one that domain randomization can bridge the rest. The same RL trick is much harder on stiff, backlash-ridden, non-backdrivable joints whose real dynamics are nasty to model. In 2026 the honest state of the art is hybrid: many production systems use RL for the locomotion controller (robust walking over bad terrain) and keep model-based planning for navigation and manipulation. The RL-everywhere vs model-based-everywhere debate is mostly settled in favor of "use both, at the layer each is good at." ## Power & runtime Legs are hungry, and the battery is heavy, and those two facts fight each other. See the [robot power & batteries guide](/posts/robot-power-batteries-ultimate-guide/) for the chemistry and pack-design details; here's what's specific to legs. ### Where the energy goes A walking quadruped spends energy on three things, roughly in this order: 1. **Holding itself up.** Because QDD joints aren't self-locking, standing and slow walking burns current on gravity compensation — the motors hold torque continuously. This is a big, often underappreciated chunk; a quadruped standing still still draws meaningful power (tens to a couple hundred watts depending on size). 2. **Moving the legs.** Accelerating leg masses every step (minimized by low distal inertia) and doing the positive work of propulsion. 3. **Everything else** — compute (a perception/autonomy stack can pull 50–150 W), sensors, comms, heaters/coolers. The result is the **1–4 hour runtimes** you see in the roster. A 15 kg Unitree Go2 might draw a few hundred watts walking; a 50 kg ANYmal or Spot draws considerably more. CoT of ~0.5 means that for every joule of "useful" gravitational-potential equivalent, you're spending several — most of it as heat in the motors and as standing overhead. ``` Crude runtime estimate: t_run ≈ (E_battery · DoD) / P_avg E_battery = pack energy [Wh] DoD = usable depth of discharge (~0.8 for Li-ion) P_avg = average power draw [W] Example: a ~600 Wh pack, DoD 0.8, walking at P_avg ≈ 250 W: t_run ≈ (600 · 0.8) / 250 ≈ 1.9 h Standing idle at P_avg ≈ 120 W: t_run ≈ (600 · 0.8) / 120 ≈ 4 h ``` ### Hot-swap and docking For any real deployment, runtime alone doesn't decide uptime — *recharge logistics* do. Two answers: - **Hot-swappable battery packs** (Spot, ANYmal, Unitree B2) — a field operator or a docking arm swaps a depleted pack for a charged one in under a minute, so the robot is down for seconds, not hours. - **Autonomous docking** — the robot walks to a charging dock between patrols. For a security or inspection robot doing scheduled rounds, a 90-minute patrol followed by a dock charge is a perfectly workable duty cycle and is how most fleet deployments actually run. The design tension is permanent: a bigger battery means longer runtime but more mass, which raises power draw (you're carrying it), which eats into the gain. There's a sweet spot, and most commercial quadrupeds have settled near the 1–2 hour mark with swap/dock as the real uptime strategy. ## Bipeds vs quadrupeds People assume two legs is the "advanced" version of four. Mechanically and control-wise it's the opposite: **four legs is dramatically easier.** ### Why four is easier than two - **A quadruped can always have a stable base.** During slow gaits it keeps three feet down — an instant stable tripod — and never has to balance on a single contact. A biped, mid-stride, is balancing the entire body on *one* foot, an inherently unstable inverted pendulum. - **The fall problem is gentler.** A quadruped that loses balance often just plants a leg and recovers; a biped that loses balance falls from standing height onto expensive hardware. - **Wider support polygon, lower CoM.** Quadrupeds are long and low; their center of mass sits inside a big support polygon. Bipeds are tall with a small base — far less margin. - **Less actuator stress per joint relative to stability.** Four legs share the body weight and the work; redundancy means a quadruped can limp on three. This is why quadrupeds matured years before humanoids. The actuator technology (QDD), the state estimation (IMU + leg kinematics EKF), the dynamic-gait control (MPC/WBC/RL) — all of it was proven on four legs first. ### The bridge to humanoids The quadruped is the humanoid's training ground. Nearly every component of a 2026 humanoid leg is inherited from quadruped work: the QDD or high-torque-density actuators, the proprioceptive torque control, the sim-trained RL locomotion policies, the contact-aware whole-body control. The hard *new* problems for bipeds — balancing on one foot, the much smaller stability margin, the coupling of locomotion with arm/manipulation dynamics — sit on top of a foundation that quadrupeds built. If you want the upright version of this story, see the [humanoid robot hardware guide](/posts/humanoid-robot-hardware-ultimate-guide/). > If you're learning legged robotics, start with quadrupeds. The physics is the same, the failures are cheaper, and almost everything transfers up to two legs. ## Applications & honest ROI Strip away the viral dancing-robot videos and the real money is in unglamorous, repetitive, hazardous-or-remote inspection. Here's the honest picture. ### Where quadrupeds actually earn their keep - **Industrial inspection** — substations, oil-and-gas facilities, chemical plants, power generation. A quadruped walks a fixed route, reads gauges (visually), images equipment with thermal and RGB, sniffs for gas, and logs acoustic anomalies — autonomously, on a schedule, in environments built for humans (stairs, catwalks, valve handles at human height). This is ANYmal's and Spot's bread and butter, and it's a real ROI story: the alternative is paying a technician to walk a hazardous route every shift. - **Mapping & survey** — construction-site progress scans (a quadruped + LiDAR doing daily reality-capture), underground mine mapping where GPS is gone and the terrain is bad. - **Security & patrol** — perimeter patrol, especially where the route includes stairs or rough ground that wheeled robots can't do. Ghost Robotics and others target this and defense. - **Research** — by unit count, this is huge. Unitree's pricing put a real dynamic-locomotion platform in hundreds of labs, accelerating the whole field. - **Disaster response & nuclear** — sending a $100k robot into a collapsed structure or a contaminated zone instead of a person. ### The honest ROI caveat Be skeptical of the breathless deployment numbers. The ROI works when **all** of these hold: the route genuinely needs legs (otherwise a cheaper wheeled AMR wins), the inspection is repetitive and frequent enough to amortize the robot, and the autonomy stack is mature enough to run without a babysitter. Many early "deployments" were really pilots with an operator standing nearby. The 2026 reality: inspection-route automation in a handful of heavy industries is genuinely paying off; general-purpose "robot dog does useful work around your facility" is still mostly aspirational. Households are not a market yet. A consumer Unitree Go2 is a wonderful research/hobby/education platform and a delightful toy. It is not doing chores. The combination of cost, runtime, manipulation limits (a quadruped with no arm can't *do* much), and safety means the home quadruped is years from a real use case. ## Building or selecting a legged robot ### Off-the-shelf vs DIY For almost everyone, **buy, don't build.** The QDD actuator, the FOC drive firmware, the state estimator, and the locomotion controller each represent years of specialized work. Unless your research *is* one of those layers, you'll get further faster on a commercial platform with an SDK. That said, the DIY path is more open than it's ever been, thanks to the open-source ecosystem the MIT Cheetah work seeded: - **MIT Mini Cheetah / Open Dynamic Robot Initiative (ODRI)** — open hardware designs for QDD legs. - **MJBots** (qdd100 actuators, moteus FOC controllers) — buy modules, build your own quadruped. - **Stanford Doggo / Pupper** — educational open-source quadrupeds at the low end. - **T-Motor AK-series / CubeMars** — affordable QDD-style actuator modules for builders. Building your own teaches you the stack like nothing else, and a basic trot is achievable for a determined team. Matching a commercial platform's robustness, autonomy, and terrain capability is a multi-year program — respect that gap. ### The cost curve and Unitree's disruption | Tier | Example | Indicative cost | What you get | |---|---|---|---| | Hobby / education | Stanford Pupper, Petoi | ~$500–$2,000 | Learn the basics; limited dynamics | | DIY QDD build | MJBots / ODRI parts | ~$3,000–$10,000 | Real dynamic legs; you write the stack | | Consumer-research | Unitree Go2 (base→EDU) | ~$1,600–$16,000 | Capable QDD quadruped + SDK | | Mid industrial | DEEP Robotics X30, Unitree B2 | ~$50,000–$100,000 | Rugged, real payload, autonomy | | Premium industrial | Spot, ANYmal, Vision 60 | ~$75,000–$150,000+ | Ecosystem, support, IP-rated, deployments | The single biggest market event of the last few years was **Unitree collapsing the price floor.** A research-grade dynamic quadruped cost ~$75,000 in 2019 (Spot's launch price). By 2024 a Unitree Go2 base unit (the Go2 Air) was ~$1,600 — a >40× drop. That did to legged-robot research what the Raspberry Pi did to embedded computing: it put real hardware in the hands of anyone with a modest budget and accelerated the entire field, while also detonating a competitive and geopolitical scramble over who supplies the world's robot dogs. ### A selection checklist > Choosing a quadruped, in order of what actually matters: > 1. **Does the terrain truly require legs?** If not, stop and buy a wheeled AMR. > 2. **Payload and sensor integration** — can it carry your inspection package, and does it expose a clean power/data interface? > 3. **SDK and autonomy maturity** — can it run your mission without a human driver? This is where Spot/ANYmal justify their price. > 4. **Support, sealing (IP rating), and field serviceability** — industrial deployment lives and dies here. > 5. **Runtime + recharge logistics** — hot-swap or dock, matched to your duty cycle. > 6. **Data governance & procurement constraints** — for industrial/government buyers, where the robot (and its data pipeline) comes from is sometimes the deciding factor regardless of specs. ## Frequently asked questions **Why do legged robots use brushless motors instead of regular servos or stepper motors?** Because dynamic legs need torque-controllable, backdrivable, high-power-density actuators, and a brushless DC motor run with field-oriented control delivers exactly that — you command torque directly via current, and a low gear ratio keeps the joint backdrivable. Hobby servos are position-only and not backdrivable; steppers are heavy for their torque and run open-loop. See the [BLDC](/posts/brushless-dc-motors-bldc-ultimate-guide/) and [robot actuators](/posts/robot-actuators-ultimate-guide/) guides. **What does "quasi-direct-drive" actually mean?** A true direct drive has no gearbox — the motor drives the joint directly. That gives perfect transparency but needs an enormous motor for useful torque. Quasi-direct-drive adds a *small* gear reduction (about 6:1 to 10:1) to get usable torque while keeping most of the transparency and backdrivability. It's the pragmatic middle ground, and it's what nearly every modern legged robot uses. **Why is the standard quadruped leg 3 degrees of freedom?** Three actuated joints (hip roll, hip pitch, knee pitch) are the minimum needed to place the foot anywhere in a 3D workspace and still control the body's roll, pitch, and height. Two DoF restricts the leg to a plane and gives up lateral balance; more than three adds weight and complexity for little locomotion benefit on a point-foot leg. **Can a quadruped really balance without cameras?** Yes — and it should. Balance is maintained from proprioception: an IMU plus joint encoders plus foot-contact information, fused in a Kalman filter to estimate body velocity and orientation at ~1 kHz. Cameras and LiDAR are for choosing footholds and navigating, not for staying upright. Good controllers walk "blind" and treat vision as anticipation. **Why do these robots need 1 kHz control loops?** Dynamic gaits leave the robot statically unstable, so it stays up by continuously catching itself. The longer the control period, the further the body falls before correction, and the harder (or impossible) the recovery. A ~1 kHz joint torque loop with low, deterministic latency is the practical floor for robust dynamic locomotion — which is why these robots run real-time control systems. See the [real-time control guide](/posts/real-time-control-systems-ultimate-guide/). **How long do quadruped robots run on a charge?** Typically 1–4 hours depending on size, gait, and payload. A small Unitree Go2 might get 1–2 hours; a larger ANYmal or Spot is similar despite a bigger battery because it's heavier and draws more power. Real-world uptime comes from hot-swappable batteries or autonomous docking, not from raw runtime. **Is reinforcement learning replacing model-based control for legged robots?** Partly, and as a complement rather than a clean replacement. RL policies trained in massively parallel simulation (with domain randomization and learned actuator models to bridge the sim-to-real gap) now drive locomotion on many platforms because they're robust to terrain and disturbances. Model-based MPC/WBC remains common, and most production stacks in 2026 use RL for walking and model-based methods for higher-level planning and manipulation. **Why is a quadruped easier to control than a humanoid?** A quadruped can keep three feet on the ground for a stable tripod and never has to balance on a single contact, has a wide support polygon and low center of mass, and recovers from disturbances by planting a leg. A biped is a tall inverted pendulum balancing on one foot for half of every stride. Four legs proved the actuators and control that humanoids now inherit — see the [humanoid hardware guide](/posts/humanoid-robot-hardware-ultimate-guide/). **What's the cheapest way to get a real dynamic quadruped?** A Unitree Go2 base unit (the Go2 Air, ~$1,600) is the cheapest capable, dynamics-ready platform with an SDK. If you want to build, MJBots qdd100 actuators with moteus controllers, or the open ODRI/Mini Cheetah designs, get you a real QDD quadruped for roughly $3,000–$10,000 in parts — plus the considerable effort of writing the control stack yourself. **Why not just use wheels with suspension instead of legs?** For most terrain, you should — wheeled and wheel-legged hybrids are more efficient and reliable. Legs only win when the terrain has discrete obstacles (stairs, gaps, large steps) that a wheel fundamentally cannot roll over. The smart middle ground is wheeled legs (wheels on articulated legs) that roll on flat ground and walk only when forced to, recovering much of the energy-efficiency gap. See the [mobile robots guide](/posts/mobile-robots-amr-agv-ultimate-guide/). **Do quadrupeds need force sensors in their feet?** Usually not. With QDD actuators you estimate joint torque from motor current, and the leg Jacobian maps that to foot force — so you get foot-force sensing "for free" without a load cell. Some robots add explicit contact switches or foot sensors for robustness, but the proprioceptive estimate is what most dynamic controllers actually use. **What gear ratio should a leg actuator use?** Roughly 6:1 to 10:1, single-stage planetary. Below ~6:1 you need an impractically large motor for the torque; above ~10:1 you start losing backdrivability and gaining reflected inertia (which scales with the square of the ratio), pushing you back toward the stiff geared-arm regime that's wrong for legs. See the [gearboxes guide](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/). ## Changelog - **2026-05-19** — Initial publication. --- # Mobile Robots: AMRs & AGVs — The Ultimate Guide URL: https://blog.robo2u.com/posts/mobile-robots-amr-agv-ultimate-guide/ Published: 2026-05-16 Updated: 2026-06-20 Tags: mobile-robots, amr, agv, slam, autonomous-navigation, warehouse-automation, differential-drive, lidar-navigation, robotics-hardware, guide Reading time: 38 min > An engineer's deep guide to mobile robots: AGV vs AMR, drive and chassis kinematics, navigation sensing, SLAM, path planning, ISO 3691-4 and R15.08 safety, opportunity charging, fleet software, and how to actually select and deploy a fleet. A mobile robot is the only machine in your facility that decides, on its own, where to put a couple of hundred kilograms of moving mass. Get the chassis, the sensing, and the safety stack right and it threads through a working aisle full of people for years. Get them wrong and you have a 0.3 m/s battering ram with a SLAM map, or — more common — a very expensive robot that sits in a corner because nobody could get it commissioned. This guide is about the machines that move loads around a floor without a human steering them: automated guided vehicles (AGVs) and autonomous mobile robots (AMRs). We will pull apart the real distinction between the two (it is not marketing), walk the drive and chassis configurations and their kinematics, go deep on the navigation sensing and the SLAM that turns LiDAR returns into a pose, cover path planning and fleet traffic, and then get serious about safety standards, charging strategy, the software stack, payload modules, and what deployment actually costs once the demo is over. Real hardware throughout: MiR, OTTO Motors, Locus, Fetch/Zebra, Amazon (Kiva), Geek+, AgileX, Clearpath. **The take**: AMRs won the mid-market because they removed infrastructure, not because they navigate better — a guidewire AGV is more deterministic than any free-roaming AMR will ever be. The engineering question is never "AMR or AGV?" in the abstract; it is "how deterministic does this path need to be, how often will the layout change, and who shares the floor?" Answer those three and the chassis, the nav method, and the safety class fall out almost mechanically. Companion reading: [LiDAR & depth cameras](/posts/lidar-depth-cameras-ultimate-guide/), [robot sensors](/posts/robot-sensors-ultimate-guide/), [motion planning & kinematics](/posts/motion-planning-kinematics-ultimate-guide/), and [ROS 2](/posts/ros2-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [AGV vs AMR — the real distinction](#agv-vs-amr) 3. [Drive & chassis configurations](#drive-chassis) 4. [Locomotion hardware](#locomotion-hw) 5. [Navigation sensing](#nav-sensing) 6. [SLAM & localization](#slam) 7. [Path planning & traffic](#path-planning) 8. [Safety](#safety) 9. [Power & charging](#power-charging) 10. [Compute & software stack](#compute-stack) 11. [Payload handling & top modules](#payloads) 12. [Deployment realities](#deployment) 13. [Selecting an AMR/AGV](#selecting) 14. [Frequently asked questions](#faq) ## Key takeaways - **AGV vs AMR is about how the path is defined, not about brand.** An AGV follows fixed infrastructure (wire, magnetic tape, reflectors, QR grid) and treats an obstacle as a reason to stop. An AMR carries a map, localizes against it, and *replans* around obstacles. Everything else — sensors, safety, drive — follows from that one choice. - **AMRs ate the mid-market because they killed the infrastructure tax.** No floor cutting, no tape to re-lay when the layout changes. But where throughput is high and the route never changes, a guided AGV is cheaper per pick and more deterministic. Both still ship in 2026. - **Differential drive is the default for a reason.** Two independently driven wheels plus casters: cheapest, simplest kinematics, zero-radius turn. It can't strafe — that's the price. MiR and Fetch are differential; Kiva-style shelf-lifts are differential. See [motion planning](/posts/motion-planning-kinematics-ultimate-guide/). - **Omni/mecanum buys you holonomic motion at a real cost.** Mecanum wheels strafe and rotate in place but lose ~15–30% of traction to roller slip, hate debris and floor seams, and wear fast. Use them where lateral docking precision beats efficiency. - **The drive motors are almost always BLDC hub or gearmotors.** Direct-drive hub motors are clean but torque-limited; geared BLDC (planetary, typically 10:1–50:1) is the workhorse. See [BLDC motors](/posts/brushless-dc-motors-bldc-ultimate-guide/), [gearboxes](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/), and [FOC controllers](/posts/motor-controllers-foc-ultimate-guide/). - **There are two LiDARs on a serious AMR, and they do different jobs.** A safety-rated scanner (SICK nanoScan3/microScan3, Pilz PSENscan) at ~15 cm height enforces protective stops and is certified to IEC 61496; a separate nav scanner builds the map. Don't conflate them. See [LiDAR & depth cameras](/posts/lidar-depth-cameras-ultimate-guide/). - **Localization is usually 2D LiDAR SLAM + AMCL against a saved map.** Natural-feature nav (no infrastructure) is the AMR default; reflector, magnetic-tape, and QR-grid nav trade flexibility for sub-centimetre repeatability where you need it. - **Navigation is a two-layer planner.** A global planner finds a route on the map; a local planner (DWB, TEB, MPPI in Nav2) reacts to live obstacles at 10–20 Hz. Fleet traffic management sits above both, handing out reservations so two robots don't claim the same intersection. - **Safety is standards-driven and non-negotiable.** Industrial AGVs fall under ISO 3691-4; AMRs in North America under ANSI/RIA R15.08. Both demand safety-rated scanners, speed-dependent protective fields, and a hardware e-stop. Functional safety is rated to PL d / SIL 2 typically. - **Opportunity charging beats battery swap for most fleets.** A robot that tops up 5–10 min at every dwell point can run a 20+ hour duty cycle on a battery sized for ~2 hours of motion. Auto-docking contacts plus a fleet manager that schedules charging is the modern pattern. See [robot power & batteries](/posts/robot-power-batteries-ultimate-guide/). - **The software stack is where deployments live or die.** Onboard nav (often ROS 2 / Nav2), a fleet manager for traffic and jobs, and an integration layer to the WMS/MES. The robot is 30% of the project; the integration is the rest. See [ROS 2](/posts/ros2-ultimate-guide/) and [industrial automation](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/). - **Top modules turn one chassis into many robots.** Conveyor decks, lift tables, tuggers, Kiva-style shelf-lifts, and mounted cobot arms (see [cobots](/posts/collaborative-robots-cobots-ultimate-guide/)) all ride the same base. The payload interface is a real design decision. - **ROI is throughput per dollar, and it hinges on uptime and integration cost, not robot price.** Budget the "integration tax": mapping, commissioning, WMS hooks, traffic tuning, and the change-management of a mixed human/robot floor. ## AGV vs AMR — the real distinction The terms get thrown around as if AMR simply means "newer AGV." It doesn't. The distinction is about **how the vehicle knows where to go and what it does when something is in the way.** An **AGV** follows guidance infrastructure. Classically that was a wire buried in the floor carrying a signal the vehicle tracked; later, magnetic tape stuck to the floor, optical lines, retroreflective targets on walls, or a grid of QR/DataMatrix codes. The path is fixed. When an obstacle appears on that path, a pure AGV stops and waits. It does not go around — it has no concept of "around," because it has no map of free space, only a line to follow. An **AMR** carries a map of the environment and continuously estimates its own pose within that map (localization). It is given a goal — a coordinate or a named station — and it computes its own route, then *replans* in real time around obstacles a planner didn't know about. Take a box off the floor and drop it in the aisle: the AGV stops; the AMR steers around it and carries on. > **The clean test**: if removing the floor infrastructure breaks navigation, it's an AGV. If you can pick the robot up, set it down in a mapped building, and it just drives, it's an AMR. ### Why AMRs ate AGVs' lunch The historical AGV cost wasn't the vehicle — it was the **infrastructure tax**. Cutting a wire channel into a finished concrete floor, or laying and maintaining magnetic tape that forklifts shred, costs real money and freezes your layout. Change the racking and you re-lay the guidance. For a facility that reconfigures seasonally, that's a recurring cost and a recurring downtime. AMRs (MiR launched 2015, Fetch and Locus around the same window) removed that. You drive the robot around once to build a map, and you're running. Re-arrange the warehouse and you re-map in an afternoon — no floor work. That flexibility, plus the safety-scanner-driven ability to share aisles with people instead of needing caged lanes, is why AMRs took the mid-market: e-commerce fulfilment, hospitals, electronics assembly, anywhere the layout and the people are fluid. ### Where AGVs still win AGVs are not legacy. Where the route never changes and throughput is high, a guided vehicle is **more deterministic and often cheaper per move**. A wire-guided tugger train running the same loop 24/7 in an automotive plant doesn't benefit from replanning — replanning is a liability you'd rather not have on a fixed high-speed route. Heavy-payload vehicles (counterbalance AGV forklifts moving 1,500 kg pallets) lean toward guided paths because the safety case for a free-roaming 2-tonne vehicle is far harder. And Amazon's fulfilment "drives" use a QR-grid floor precisely because a deterministic grid lets thousands of robots run dense, coordinated traffic at speed — that's an infrastructure-guided system by design, not a fleet of free-roaming AMRs. | Dimension | AGV (infrastructure-guided) | AMR (map-based autonomous) | |---|---|---| | Path definition | Fixed: wire, mag-tape, optical, reflector, QR grid | Dynamic: computed on a map, replanned live | | Obstacle response | Stop and wait | Reroute around it | | Infrastructure | Floor/wall modification required | None (drive-to-map) | | Layout change cost | High (re-lay guidance, downtime) | Low (re-map) | | Determinism / repeatability | Very high (sub-cm on guidance) | Lower (±1–5 cm typical free-nav, tighter with fiducial docking) | | Sharing space with people | Caged lanes or slow zones, historically | Designed for it (safety scanners, dynamic zones) | | Throughput on fixed routes | Excellent | Good | | Per-pick economics on fixed loop | Often lower | Higher (compute, sensing) | | Typical payload sweet spot | 100–3,000+ kg | 50–1,500 kg | | Examples | Wire-guided tuggers, counterbalance AGV-forklifts, Amazon QR drives | MiR, OTTO, Locus, Fetch/Zebra, Geek+ | In practice the line blurs. "Hybrid" vehicles run free-nav in open areas and snap to magnetic tape or fiducials for precise docking. OTTO and MiR vehicles will use floor or wall markers to dock to a conveyor within ±1 cm while navigating naturally everywhere else. The taxonomy is a spectrum of *how much determinism you buy with infrastructure*, not a binary. ## Drive & chassis configurations The drive configuration sets the robot's kinematics — what motions it can and cannot make — and that ripples into the planner, the docking strategy, and the cost. Pick this first; everything downstream inherits it. The math here is the planar mobile-robot kinematics covered in the [motion planning guide](/posts/motion-planning-kinematics-ultimate-guide/); here we care about the practical tradeoffs. ### Differential drive Two independently driven wheels on a common axis, plus one or more passive casters for balance. It is the default for indoor AMRs (MiR, Fetch, Kiva-class shelf-lifts) because it is mechanically dead simple, cheap, and can spin in place — a zero turning radius. The forward kinematics are clean. With wheel radius `r`, wheel separation (track width) `L`, and left/right wheel angular velocities `ω_L`, `ω_R`: ``` v_L = r · ω_L # left wheel linear speed v_R = r · ω_R # right wheel linear speed v = (v_R + v_L) / 2 # body linear velocity (m/s) ω = (v_R − v_L) / L # body angular velocity (rad/s) # Integrate to get pose (x, y, θ), e.g. each control tick dt: θ_new = θ + ω · dt x_new = x + v · cos(θ + ω·dt/2) · dt y_new = y + v · sin(θ + ω·dt/2) · dt # Pure spin in place: v_R = −v_L → v = 0, ω ≠ 0 ``` The cost of all that simplicity: it is **nonholonomic**. It cannot move sideways. To shift 10 cm laterally to dock against a conveyor it must do a little turn-drive-turn dance, which eats time and floor space. For most warehouse work that's fine — you design dock approaches as straight-in. ### Omnidirectional (omni/mecanum) Mecanum wheels have angled rollers (typically 45°) around the rim; omni wheels have rollers perpendicular to the rolling direction. Drive four of them with the right velocity mix and the chassis becomes **holonomic** — it can translate in any direction and rotate independently, all at once. It can strafe straight into a dock with no maneuvering. For a four-mecanum chassis with half-track `a` and half-wheelbase `b`, the inverse kinematics (body velocity → wheel speeds) are: ``` # Body command: vx (forward), vy (left), ωz (yaw), wheels at corners ω_FL = (1/r)·(vx − vy − (a+b)·ωz) ω_FR = (1/r)·(vx + vy + (a+b)·ωz) ω_RL = (1/r)·(vx + vy − (a+b)·ωz) ω_RR = (1/r)·(vx − vy + (a+b)·ωz) ``` The price is steep and physical: the angled rollers slip by design, so you lose roughly 15–30% of available traction and your odometry is noticeably worse than differential. Mecanum wheels also hate floor debris, seams, and ramps — a small bolt jams a roller — and they wear faster. Use omni/mecanum where lateral precision in a tight footprint genuinely pays: machine tending, narrow-aisle docking, mobile manipulation cells. Don't use it for long-haul transport; you're burning energy and tire life for a capability you rarely exercise. ### Steered / swerve drive Each wheel module both drives and steers (the "swerve" you know from FRC robotics). Two-to-four steered drive modules give holonomic-like motion *without* the roller slip — full traction, good odometry, can translate any direction. The catch is mechanical and control complexity: each module is a drive motor plus a steer motor plus its own controller, and coordinating module heading during transitions is nontrivial. You see this on higher-end heavy AMRs and some outdoor platforms where you want omnidirectionality and traction both. ### Tricycle One steered+driven front wheel and two passive rear wheels (or the mirror). This is classic AGV-forklift geometry. It's robust and carries heavy loads well, but it has a turning radius (no spin-in-place) and the kinematics put a hard constraint on tight-space maneuvering. Counterbalance AGV-forklifts and many tow-tractors use it. ### Ackermann (car-like) Front wheels steer like a car, rear wheels drive. Used almost exclusively on **outdoor** mobile robots and larger yard vehicles (AgileX Hunter/Bunker-class, Clearpath outdoor platforms) where speed and ride quality matter and tight indoor maneuvering doesn't. It has a minimum turning radius set by the wheelbase and max steer angle, so it cannot turn in place — a planner constraint you carry everywhere. | Drive type | Holonomic? | Spin in place? | Odometry quality | Traction efficiency | Complexity | Typical use | |---|---|---|---|---|---|---| | Differential | No | Yes | Good | High | Low | Indoor AMRs, shelf-lifts | | Omni / mecanum | Yes | Yes | Poor | Low (slip) | Medium | Tight docking, mobile manipulation | | Swerve (steered) | Near-holonomic | Yes | Good | High | High | Heavy/premium AMRs | | Tricycle | No | No | Good | High | Low–Med | AGV-forklifts, tuggers | | Ackermann | No | No | Good | High | Medium | Outdoor / yard robots | > **Rule**: choose the *least* capable drive that meets your motion requirement. Every step up the holonomy ladder costs traction, money, odometry, or all three. Differential until you can prove you need lateral motion. ## Locomotion hardware Underneath the kinematics is real metal: motors, gearboxes, wheels, casters, suspension. This is where load capacity, ramp ability, and battery runtime actually get decided. ### Drive motors: hub vs geared BLDC The drive motors on essentially every modern mobile robot are **brushless DC** (BLDC/PMSM) for the efficiency, torque density, and lifetime — brushes are a maintenance item nobody wants on a 24/7 fleet. See the [BLDC guide](/posts/brushless-dc-motors-bldc-ultimate-guide/) for the motor physics. Two packaging choices: **Direct-drive hub motors** put the motor in the wheel. Clean, compact, no gearbox to maintain, and quiet. The problem is torque: an outer-rotor hub motor sized to fit a 150 mm wheel struggles to deliver the low-speed torque needed to break away a heavy load or climb a ramp without overheating. Hub motors suit lighter robots and flat floors. **Geared BLDC** — a BLDC motor through a planetary reduction, typically **10:1 to 50:1** — is the workhorse. The reduction multiplies torque and lets a small, fast, efficient motor move a heavy robot up a dock ramp. The tradeoff is gearbox losses (a couple of percent per stage), backlash (matters for precise docking), and a wear item. Planetary is standard; for the very high reductions and zero-backlash some precision docking wants, you occasionally see cycloidal — see the [gearbox guide](/posts/gearboxes-harmonic-cycloidal-ultimate-guide/). Both are driven by **field-oriented control** (FOC) servo drives that give you smooth torque at low speed and clean velocity control for the differential-drive math above. The [motor controller / FOC guide](/posts/motor-controllers-foc-ultimate-guide/) covers the drives; on a mobile robot the controller also feeds wheel-encoder ticks back as odometry, which the SLAM stack fuses with LiDAR. ### Sizing the drive: a torque sanity check To climb a ramp of grade `α` at acceleration `a`, each driven wheel must overcome gravity component, rolling resistance, and inertia: ``` m = 300 kg # robot + payload g = 9.81 m/s² α = 5° # ramp grade (0.087 rad) Crr = 0.02 # rolling resistance coeff (poly wheel on concrete) a = 0.5 m/s² # commanded accel r_wheel = 0.10 m # wheel radius n_drive = 2 # driven wheels F_total = m·g·sin(α) + Crr·m·g·cos(α) + m·a = 300·9.81·0.0872 + 0.02·300·9.81·0.996 + 300·0.5 ≈ 257 + 59 + 150 ≈ 466 N T_wheel = F_total · r_wheel / n_drive = 466 · 0.10 / 2 ≈ 23.3 N·m per driven wheel ``` That 23 N·m per wheel is what sizes the gearmotor. Note the acceleration term (150 N) dominates the ramp term here — aggressive accel/decel, not slopes, is usually what overheats undersized drives. Always size for the worst-case payload *plus* the accel you actually command, not the nameplate flat-floor figure. ### Wheels, casters, suspension **Drive wheels** are typically polyurethane on an aluminium hub — good `Crr`, quiet, non-marking, decent grip. Hardness (Shore A) trades grip for life and rolling resistance. Pneumatic only shows up outdoors. **Casters** carry the undriven load and define stability. The classic indoor AMR is two center drive wheels plus four corner casters — but that "rocking horse" layout can lift a drive wheel off the floor on an uneven surface, killing traction and odometry. The fix is **suspension**: spring-loaded drive modules that keep both drive wheels loaded with a defined normal force regardless of floor flatness. Any serious AMR (MiR, OTTO) has sprung drive modules. Skipping suspension is the classic cheap-AMR failure on a real, slightly-uneven warehouse floor. **Load capacity** is set by the weakest of: motor/gearbox torque, wheel rating, caster rating, frame stiffness, and — critically — the **safety case** (a heavier robot needs longer stopping distance and bigger protective fields). Published payloads (MiR250 = 250 kg, MiR600 = 600 kg, MiR1350 = 1,350 kg; OTTO 100/600/1500 = 150/600/1,500 kg) are continuous safe ratings, not what the frame survives once. ## Navigation sensing A mobile robot needs to answer two sensing questions continuously: *where am I* (localization) and *what's in front of me right now* (obstacle/safety). Different sensors, often deliberately separate. The full sensor taxonomy is in the [robot sensors guide](/posts/robot-sensors-ultimate-guide/); the ranging physics is in the [LiDAR & depth camera guide](/posts/lidar-depth-cameras-ultimate-guide/). ### The two-LiDAR architecture This trips up newcomers constantly: a serious AMR often has **two different LiDARs doing two different jobs.** The **safety scanner** is a safety-rated 2D LiDAR mounted low (≈10–20 cm above the floor) — SICK nanoScan3/microScan3, Pilz PSENscan, Hokuyo UAM. It is certified to **IEC 61496-3** (electro-sensitive protective equipment) and its only job is to enforce protective stops: it watches configurable 2D fields and triggers a hardware-level slowdown or stop when something enters them. It is not primarily a mapping sensor; its data is trusted by the safety controller. Mounting it low catches feet, pallet jacks, and forklift tines. The **navigation scanner** builds and matches the map. It can be the same physical unit on cheaper robots (a safety scanner whose measurement data is *also* fed to SLAM), or a separate non-safety LiDAR. Often it's mounted higher to see over low clutter and pick up stable wall/rack features. > **Why two?** The safety scanner's field must be certified and unchanging; the nav scanner's data can be filtered, downsampled, and fused freely. Conflating safety and perception is how you end up with a robot that's either unsafe or that nuisance-stops constantly. ### 2D vs 3D LiDAR Most indoor AMRs navigate on **2D LiDAR** — a single scanning plane. It's cheap, the data is light, and a 2D map is enough to localize against walls and racking. The blind spot is literal: a 2D plane at 15 cm misses a forklift tine at 40 cm or an overhanging shelf. That's why 2D-LiDAR AMRs add **depth cameras** angled down/forward to catch obstacles off the scan plane — low-hanging, overhanging, or floor-level (a dropped pallet, a step-down). **3D LiDAR** (Ouster, Livox, Hesai — covered in the LiDAR guide) is appearing on outdoor and high-end AMRs where the environment is genuinely three-dimensional and a single plane isn't enough. It costs more and produces far more data to process. Indoors, 2D LiDAR + a couple of depth cameras remains the cost-effective sweet spot in 2026. ### Depth cameras and the rest **Depth cameras** (Intel RealSense-class, stereo, structured-light, ToF) fill the 3D gaps the 2D scanner misses and feed obstacle layers in the costmap. **3D ultrasonic / cliff sensors** catch things lasers miss (glass walls, downward stairs/loading-dock edges) — glass is a notorious 2D-LiDAR failure because it passes the beam. **Wheel encoders + IMU** provide odometry that the SLAM filter fuses between LiDAR scans. A robot that relies on LiDAR alone will localize beautifully right up until it drives off a loading dock the laser couldn't see. ## SLAM & localization Two distinct phases get conflated under "SLAM": building the map (mapping) and figuring out where you are in an existing map (localization). Most production AMRs map *once* and then localize against the saved map; full online SLAM runs mainly during commissioning. ### LiDAR SLAM and the map **SLAM** — simultaneous localization and mapping — builds a map while estimating the robot's pose in it, solving the chicken-and-egg problem that you need a map to localize and a pose to map. Indoor AMRs overwhelmingly use **2D LiDAR SLAM**: graph-based scan matching (Google Cartographer, slam_toolbox in ROS 2) that aligns successive scans, builds a pose graph, and runs **loop closure** to correct drift when the robot revisits a known place. The output is an **occupancy grid** — a 2D bitmap where each cell is free, occupied, or unknown, at a resolution like 5 cm/cell. That map is the shared reference for everything: localization matches against it, the global planner routes on it, the costmap inflates obstacles on it. ### AMCL — localizing in a known map Once you have a map, you don't re-run full SLAM — you run **AMCL** (Adaptive Monte Carlo Localization), a particle filter. It scatters hundreds of candidate poses ("particles"), predicts how each would move given the odometry, scores each by how well the live LiDAR scan matches the map at that pose, and resamples toward the high-scoring ones. The particle cloud converges to the true pose and tracks it. "Adaptive" means it varies the particle count — more when uncertain (the "kidnapped robot" just powered on), fewer when confident. > **The failure mode to know**: AMCL needs *features*. Put an AMR in a long, featureless corridor or a wide empty floor with no walls in range and the scan matches equally well everywhere along the corridor — localization slides. The fix is environmental: keep stable features in sensor range, or add fiducials in feature-poor zones. ### The navigation method spectrum How a vehicle knows where it is spans a spectrum from zero infrastructure to total infrastructure, trading flexibility for repeatability: - **Natural-feature (free) navigation** — pure map-based SLAM/AMCL, no infrastructure. The AMR default (MiR, Fetch, OTTO). Maximum flexibility; repeatability ±1–5 cm depending on feature richness. - **Reflector navigation** — retroreflective targets surveyed onto walls; the scanner triangulates off them. Classic AGV method, very repeatable (sub-cm), but you must survey and maintain the reflectors. - **Magnetic-tape / magnetic-spot navigation** — tape or embedded magnets in the floor. Dead simple, robust to lighting and dust, but it's a fixed path and the tape wears under forklift traffic. - **QR / fiducial-grid navigation** — a grid of coded markers on the floor; the robot reads them with a downward camera and dead-reckons between. This is the Amazon/Kiva method: extremely deterministic, enables ultra-dense coordinated traffic, but it's an infrastructure-heavy AGV approach. Most real deployments are **hybrid**: natural-feature nav for the open floor, plus a fiducial or magnetic spot at each dock for the last 20 cm of precision where ±1 cm matters and SLAM's ±3 cm doesn't cut it. ## Path planning & traffic Given a goal pose, the robot has to produce safe wheel commands while reacting to a world that changes. The standard architecture is a **two-layer planner** plus a fleet-level coordinator. The general planning theory is in the [motion planning guide](/posts/motion-planning-kinematics-ultimate-guide/); the integration glue is in the [ROS 2 guide](/posts/ros2-ultimate-guide/). ### Global planner The **global planner** searches the map for a route from start to goal — A*, Dijkstra, or a state-lattice/Theta* variant — operating on the occupancy grid plus a static costmap (walls inflated by the robot radius, plus keep-out zones and preferred lanes you draw in). It produces a path but doesn't care about dynamic obstacles; it runs at low rate, e.g. on each new goal or every second. ### Local planner The **local planner** turns that global path into actual velocity commands at 10–20 Hz while dodging things the global planner never saw — a person stepping out, another robot, a dropped box. In the ROS 2 **Nav2** stack the choices are: - **DWB** (Dynamic Window Approach, the Nav2 default) — samples feasible `(v, ω)` commands within the robot's dynamic limits, simulates each forward, scores them against the path and obstacles, picks the best. - **TEB** (Timed Elastic Band) — optimizes a trajectory with time, good for car-like/Ackermann constraints and tight spaces. - **MPPI** (Model Predictive Path Integral) — sampling-based MPC, increasingly the choice for smooth, dynamics-aware control on differential and omni bases. The local planner reads a **local costmap** — a rolling window around the robot fused from the safety scanner, nav LiDAR, and depth cameras — with **obstacle inflation** so the robot keeps clearance from its hull, not just its center point. ### Nav2 in one breath Nav2 (the ROS 2 navigation stack) wires this together: a behavior tree orchestrates "compute path → follow path → recover if stuck," the global and local planners are pluggable, AMCL provides the pose, and recovery behaviors (spin, back up, clear costmap, wait) handle the inevitable "I'm wedged" cases. It's the de-facto open stack; vendor AMRs run proprietary equivalents with the same shape. ### Fleet & traffic management One robot is a planning problem; fifty robots is a **traffic** problem. A fleet manager sits above the per-robot planners and prevents the failure modes of independent agents: two robots claiming the same narrow aisle head-on (deadlock), or both arriving at one intersection. ``` # Fleet sizing — back-of-envelope for a transport task tasks_per_hour = 120 # demand (moves/hour) dist_per_task = 80 # m (avg loaded + return) avg_speed = 1.2 # m/s effective (incl. accel/decel/turns) load_unload = 30 # s per task (dock + transfer) charge_overhead = 0.12 # 12% of time charging travel_time = dist_per_task / avg_speed # = 66.7 s cycle_time = travel_time + load_unload # = 96.7 s tasks_per_robot_hr = 3600 / cycle_time × (1 − charge_overhead) = 37.2 × 0.88 ≈ 32.8 tasks/robot/hour robots_needed = ceil(tasks_per_hour / tasks_per_robot_hr) = ceil(120 / 32.8) = ceil(3.66) = 4 robots # Then add congestion margin: dense traffic erodes effective speed # 10–25% as robot count rises — size for 5, not 4. ``` The coordinator uses **reservation/zone allocation**: a robot must reserve a path segment or intersection before entering, and the manager grants reservations to avoid conflicts, sometimes with priority rules (loaded beats empty). It also handles charging dispatch and job assignment. This congestion effect is real and nonlinear — adding robots past a point *lowers* throughput as they queue. Model it; don't just divide demand by per-robot rate. ## Safety This section is not optional reading. A mobile robot is a moving mass on a floor with people, and the safety case is a legal and ethical requirement, not a feature. The functional-safety background is in the [industrial automation guide](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/); here's what's specific to mobile robots. ### The standards Two regimes dominate in 2026: - **ISO 3691-4** — "Industrial trucks — Driverless industrial trucks and their systems." This is the standard for AGVs and AMRs treated as industrial trucks (the forklift/tugger/heavy lineage), widely referenced in Europe and globally. It specifies stability, control, protective devices, and the safety functions. - **ANSI/RIA R15.08** — the North American standard specifically for **industrial mobile robots (IMRs)**, written for the AMR era. Part 1 covers the robot manufacturer, Part 2 the integrator, Part 3 the user. If you deploy AMRs in the US, R15.08 is your framework. Both require that safety functions reach a rated integrity — typically **Performance Level d (PL d)** per ISO 13849 or **SIL 2** per IEC 62061 for the protective stop. That rating drives the whole sensing/control chain: dual-channel, monitored, with diagnostic coverage. ### Safety-rated scanners and speed zones The enforcer is the **safety-rated LiDAR scanner** (SICK nanoScan3/microScan3, Pilz PSENscan, Hokuyo UAM/SafetyScanner), certified to **IEC 61496-3**, wired into a safety controller — not the navigation computer. It monitors configurable **protective fields**: - A **warning field** (outer) that slows the robot. - A **protective field** (inner) that triggers a safety stop. Crucially, these fields **scale with speed**. At 1.5 m/s the protective field reaches far ahead because the stopping distance is long; as the robot slows for a turn or a tight aisle, the fields shrink so it doesn't nuisance-stop on nearby walls. This **speed-dependent field switching** is the heart of a mobile safety case — the field must always exceed the stopping distance at the current speed. > **Stopping distance is the design driver.** It is `d = v²/(2a) + v·t_react`, where `t_react` includes sensor latency, safety-controller response, and brake engagement. A 300 kg robot at 1.5 m/s with 0.7 m/s² braking and 0.2 s reaction needs ≈1.6 m + 0.3 m ≈ 1.9 m of protective field. That number sizes the scanner range and the aisle width. ### E-stop and the rest A **hardware emergency stop** — a physical mushroom button cutting motor power through the safety circuit, independent of software — is mandatory. Add warning lights/sounds (mandated motion indicators in many jurisdictions), and 2D-scanner blind-spot coverage with depth cameras and bumpers. The bumper is the last line: a compliant contact edge that triggers a stop on touch, because no scanner sees everything. Remember the scanner sees a **2D plane**. A forklift tine at 30 cm, an overhanging load, a child's hand reaching down — these are off-plane and the safety scanner misses them. The complete safety case layers the 2D protective field with 3D perception, contact bumpers, speed limits, and zoning. Anyone selling you a single-scanner safety story for a mixed human floor is cutting a corner you'll regret. ## Power & charging Battery and charging strategy decide your fleet's effective availability more than peak speed does. A robot that's charging is a robot that isn't working. The cell chemistry, BMS, and sizing detail is in the [robot power & batteries guide](/posts/robot-power-batteries-ultimate-guide/); here's the mobile-robot-specific strategy. ### Chemistry Modern AMRs run **lithium** — predominantly **LiFePO4 (LFP)** for the cycle life (3,000–6,000 cycles), thermal safety, and tolerance of partial charging, or NMC where energy density matters more than longevity. LFP's flat discharge curve and abuse tolerance make it the fleet default. Lead-acid persists only on legacy/heavy AGVs and is fading — its ~500-cycle life and dislike of partial charging make it a poor fit for the duty cycle below. ### The duty-cycle and opportunity-charging model The old model was **battery swap**: run the battery flat over a shift, swap in a charged one, charge the dead one offline. It works but needs spare batteries (capital), a swap station, and labor. The modern model is **opportunity charging**: the robot tops up in short bursts during natural dwell time — while waiting at a pick station, between jobs, parked for 8 minutes. Because LFP tolerates frequent partial charges, a robot can sustain a 20+ hour effective duty on a battery sized for only ~2 hours of continuous motion, as long as the dwell time and charger placement give it enough top-up windows. ``` # Opportunity-charging duty-cycle sanity check batt_capacity = 1.5 kWh # usable draw_moving = 250 W # avg while driving (incl. accessories) draw_idle = 40 W # parked, computer on charge_rate = 1500 W # 1C-ish fast charge at contacts # In a 60-min window: 40 min moving, 12 min idle-waiting, 8 min charging energy_out = (40/60)·250 + (12/60)·40 = 166.7 + 8 = 174.7 Wh energy_in = (8/60)·1500 = 200 Wh net = +25.3 Wh per hour → energy-positive, runs indefinitely # If you cut charging to 4 min/hr: energy_in = (4/60)·1500 = 100 Wh → net −74.7 Wh/hr # At 1500 Wh usable, runs ~20 hr then must take a long charge. ``` The lesson: it's not battery size, it's the **ratio of charge windows to work**. Design the charger locations so every robot passes a charger during natural dwell, and the fleet runs nearly around the clock on small batteries. ### Auto-docking **Auto-docking** to a charger closes the loop without human help. The robot navigates to the charger, then uses a fiducial (reflector pattern or AprilTag) for the final precise approach, and engages **contact charging** — sprung blade contacts that mate to floor/wall pads. Contact charging is simpler and cheaper than inductive (wireless) charging, which exists but adds cost and ~10–15% efficiency loss for the convenience of no exposed contacts. The fleet manager schedules charging as just another job, sending robots to chargers based on state-of-charge and demand so the fleet never all charges at once. ## Compute & software stack The mobile robot is a distributed software system on wheels. The stack has three tiers, and the integration between them is where most project risk lives. ### Onboard compute The nav computer is typically an **x86 industrial PC** (for ROS 2 / Nav2 stacks) or an **NVIDIA Jetson** (Orin-class) where GPU perception matters — running the SLAM, costmaps, planners, and sensor drivers. Alongside it sits a **safety controller** (a separate, certified safety PLC) that owns the protective stop and e-stop circuit, *independent of the nav computer* — because you cannot put PL d safety on a general-purpose Linux box. Motor controllers (FOC drives, see the [controller guide](/posts/motor-controllers-foc-ultimate-guide/)) hang off a real-time bus (CAN/EtherCAT), reporting odometry and taking velocity commands. ### The nav stack: ROS 2 / Nav2 Open AMRs and most research/integrator platforms (Clearpath, AgileX, custom builds) run **ROS 2** with **Nav2**. The [ROS 2 guide](/posts/ros2-ultimate-guide/) goes deep; the relevant shape here: sensor drivers publish scans and point clouds, `slam_toolbox` or AMCL provides the pose, Nav2's behavior tree orchestrates planning, and `tf2` keeps every frame (`map → odom → base_link → sensors`) consistent. The `map → odom` transform is AMCL's correction; `odom → base_link` is the wheel/IMU odometry. Get those frames wrong and nothing works — it's the single most common ROS 2 navigation bug. Commercial vendors (MiR, OTTO, Geek+) run proprietary stacks of the same architecture, trading openness for a turnkey, supported, safety-certified product. The choice is build-vs-buy: ROS 2/Nav2 gives flexibility and no license fee at the cost of you owning the integration and the safety certification; a commercial AMR gives you a certified product and a support contract at the cost of a closed stack. ### Fleet manager and WMS/MES integration Above the robots, the **fleet manager** handles traffic, job assignment, charging, and a map shared across the fleet (MiR Fleet, OTTO Fleet Manager, Locus, or open frameworks like Open-RMF). It exposes an API the higher systems drive. The top tier is your **WMS/MES** (warehouse/manufacturing execution system). This is the integration that makes the fleet do useful work: the WMS knows there's a pick at location A4 destined for pack station 3, and it must hand that as a job to the fleet manager, get status back, and reconcile inventory. That integration — message formats, error handling, what happens when a robot can't reach a station, how a human-cancelled job propagates — is the bulk of the engineering effort and the bulk of the project risk. The [industrial automation guide](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/) covers the PLC/SCADA/MES world the fleet plugs into. **VDA 5050** is the emerging standard interface between fleet managers and mixed-vendor AMRs — worth specifying if you ever want multi-vendor fleets. ## Payload handling & top modules The chassis is a transport base; what it carries is the **top module**, and a single base platform usually supports several. This modularity is a core economic argument for AMRs — one validated, safety-certified base, many jobs. ### The common modules - **Flat top / shelf** — the simplest: a deck you set a tote or bin on, or where a human loads/unloads. Locus and many fulfilment AMRs are essentially mobile shelves a picker walks to. - **Conveyor deck** — a powered roller/belt top that auto-transfers a tote to/from a fixed conveyor or another robot. Removes the human from the transfer; demands precise docking (±1 cm) to line up the rollers. - **Lift / jacking module** — a vertical lift table that raises a load, or the **shelf-lift** (Kiva/Amazon, Geek+) that drives *under* a mobile rack, lifts it, and carries the whole shelf to a human picker. The shelf-lift model — "goods-to-person" — was Kiva's 2012 revolution: instead of pickers walking miles, the shelves come to them. It needs a structured floor (the QR grid) and a fleet manager doing dense coordination. - **Tugger / tow** — a hitch that pulls one or more passive carts (a "tugger train"). High effective payload (tow several hundred kg of cart) on a modest base; the AGV-classic for line-side delivery in automotive/manufacturing. OTTO and many AGVs offer tow variants. ### Mounting a cobot arm Put a **collaborative arm** on a mobile base and you get a **mobile manipulator** — a robot that can both drive to a location *and* do dexterous work there (machine tending, pick-and-place across a cell, sample handling in a lab). The arm is usually a cobot (UR, Doosan, Techman — see the [cobots guide](/posts/collaborative-robots-cobots-ultimate-guide/)) precisely because the combined system shares space with people and the cobot's force-limiting safety complements the base's scanner safety. > **The hard part of mobile manipulation is the base pose.** A ±3 cm base localization error is fine for transport but is a disaster for a 6-DoF grasp — the arm's working envelope can't absorb it. The standard fix: drive to a rough pose, then use the arm's wrist camera (visual servoing) or a fiducial to refine the actual base/target transform before the grasp. The base gets you to the neighborhood; vision closes the last centimetres. Payload interface matters mechanically too: a 10 kg arm reaching out 1 m puts a real overturning moment on the base, so a manipulation AMR needs a wider stance, lower CoG, and a stiffer frame than a pure-transport robot of the same payload. ## Deployment realities The demo always works. The deployment is where reality charges its tax. Here's what actually consumes the budget and the timeline. ### Mapping and commissioning Mapping is fast — drive the building once, save the occupancy grid, a few hours. **Commissioning** is not. It's defining keep-out zones, drawing preferred lanes and one-way aisles, placing and surveying docking fiducials, tuning protective-field sizes against real aisle widths, setting speed zones, validating the safety case with the actual robot at actual speed, and integrating the WMS jobs. Budget weeks, not days, for a non-trivial fleet — and budget a safety assessor's time. ### Mixed human/robot floors The single biggest operational reality is that warehouses are full of **people, forklifts, and chaos** the planner didn't model. Pallets get left in aisles. A forklift cuts off a robot. Someone stacks boxes against a wall the map says is clear, and AMCL gets confused. People learn to "bully" robots (they always yield, so people walk right at them and the robot freezes). These aren't bugs; they're the environment. Mitigations: clear AMR lanes where you can, train staff, set realistic protective fields (too conservative = constant freezing = people lose faith and unplug the robots), and accept that throughput in a shared aisle is lower than a caged route. ### The integration tax and ROI The robot's purchase price is a minority of the project. The **integration tax** is WMS/MES hookup, network/Wi-Fi coverage (AMRs need reliable coverage along every route — dead spots cause stalls), charger infrastructure, fiducials, commissioning labor, safety assessment, and staff training. ``` # Illustrative 5-robot AMR project cost split robots (5 × $45k) = $225k # ~45% fleet manager + licenses = $40k # ~8% WMS/MES integration = $90k # ~18% <- the tax commissioning + safety = $60k # ~12% charging + infrastructure = $35k # ~7% Wi-Fi / network upgrade = $30k # ~6% training + contingency = $20k # ~4% --------- total = $500k # robots are < half ``` ROI is throughput-per-dollar over the system life, dominated by **labor displaced/redeployed and uptime**. The math works when the robots run a high duty cycle on a stable task; it fails when the task changes constantly (re-commissioning eats the savings) or when nuisance-stops and integration gaps keep effective utilization low. The honest payback on a well-matched warehouse fleet is typically **1.5–3 years**; a poorly-matched one never pays back because utilization never reaches the model. > **Rule**: the project succeeds or fails on *utilization*, not robot count. A fleet at 85% utilization on a stable task beats a bigger fleet at 40% utilization every time. Spend the engineering on the integration and the floor, not on buying more robots. ## Selecting an AMR/AGV Selection collapses to three questions, in order. Get these right and the rest is comparison shopping. ### The three questions 1. **Payload and form** — what are you moving, how heavy, how big? A 30 kg tote is a different robot from a 1,200 kg pallet. This sets the chassis class and largely the vendor shortlist. 2. **Environment** — indoor/outdoor, floor flatness, aisle width, ramps, who shares the space, how often the layout changes. This sets drive type (differential indoors, Ackermann outdoors), nav method (free-nav for changing layouts, guided for fixed high-throughput), and the safety class. 3. **Throughput** — moves per hour, distances, dwell time. This sets fleet size (with the congestion margin from the [path planning](#path-planning) section) and charging strategy. > **Decision shortcut**: *Changing layout + shared with people + moderate throughput* → free-nav AMR (MiR/OTTO/Fetch class). *Fixed high-throughput loop + heavy payload* → guided AGV. *Goods-to-person fulfilment at scale* → Kiva/Geek+ shelf-lift on a structured floor. *Drive + dexterous work* → mobile manipulator (AMR base + cobot). ### Real-product comparison Representative platforms across the classes (figures are nominal published specs; confirm against current datasheets before you commit): | Platform | Class | Payload | Drive | Nav method | Top speed | Notable | |---|---|---|---|---|---|---| | MiR250 | Indoor AMR | 250 kg | Differential | Free-nav (2D LiDAR SLAM) | ~2.0 m/s | Compact, large module ecosystem | | MiR600 / 1350 | Heavy indoor AMR | 600 / 1,350 kg | Differential | Free-nav, IP52 | ~1.2–2.0 m/s | Pallet-class, ISO 3691-4 | | OTTO 100 / 600 / 1500 | Indoor AMR | 150 / 600 / 1,500 kg | Differential | Free-nav | ~2.0 m/s | Heavy-duty, strong fleet mgr | | Fetch / Zebra (e.g. FlexShelf/Freight) | Fulfilment AMR | ~50–1,500 kg (range) | Differential | Free-nav | ~1.5 m/s | Now Zebra; warehouse focus | | Locus (LocusBots) | Goods-to-person assist | tote-class | Differential | Free-nav | ~1.5–2.0 m/s | Picker-following model | | Amazon (Kiva) drive | Shelf-lift AGV | ~450–1,300 kg shelf | Differential | QR-grid (structured floor) | ~1.7 m/s | Dense coordinated fleet | | Geek+ P-series | Goods-to-person shelf-lift | ~600–1,000 kg | Differential | QR-grid / fiducial | ~1.5 m/s | Kiva-style, large installs | | AgileX (Scout/Bunker/Hunter) | Outdoor / research base | ~50–150 kg | Diff / tracked / Ackermann | Configurable (ROS) | ~1.5–4.5 m/s | Dev platforms, outdoor-capable | | Clearpath (Husky/Jackal/Dingo) | Research / outdoor | ~20–75 kg | Diff / mecanum | ROS 2, BYO nav | ~1–2 m/s | R&D, sensor integration | A note on context: companies like iRobot proved the consumer end of mobile autonomy (Roomba's vacuum-class SLAM and bump-and-coverage navigation) a decade before warehouse AMRs matured — different scale and safety case, same core problem of localizing and covering a space without infrastructure. The warehouse AMR is that consumer lineage grown up, hardened, and wrapped in a PL d safety case. > **Final rule**: don't buy the robot with the best spec sheet; buy the robot whose *vendor support and software maturity* match your team's integration capability. A team without ROS 2 depth should buy a turnkey commercial AMR; a team with strong robotics engineers can extract more value (and lower cost) from a ROS 2/Nav2 platform — but owns the integration and the safety case. The spec sheet is the easy 20% of the decision. ## Frequently asked questions **What is the actual difference between an AGV and an AMR?** An AGV follows fixed guidance infrastructure (wire, magnetic tape, reflectors, QR grid) and stops when an obstacle blocks its path. An AMR carries a map, localizes against it, and replans around obstacles autonomously. The test: if removing the floor/wall infrastructure breaks navigation, it's an AGV; if you can set it down in any mapped building and it drives, it's an AMR. See the [comparison section](#agv-vs-amr). **Are AGVs obsolete now that AMRs exist?** No. AGVs remain cheaper and more deterministic on fixed, high-throughput routes and for heavy payloads where a free-roaming safety case is hard. Amazon's massive fulfilment fleets run on a QR-grid (an infrastructure-guided system) precisely because determinism enables dense coordinated traffic. Choose by how often the layout changes and how deterministic the route must be. **Why is differential drive so common when omni/mecanum can strafe?** Differential drive is the cheapest, simplest, most efficient configuration that can still turn in place, and its odometry is good. Mecanum's holonomic motion costs 15–30% of traction to roller slip, degrades odometry, hates floor debris and seams, and wears faster. You pay a lot for lateral motion you rarely need. Use mecanum only where tight-footprint lateral docking genuinely pays. **Do AMRs use one LiDAR or two?** Serious ones often use two with different jobs: a **safety-rated scanner** (SICK/Pilz/Hokuyo, certified to IEC 61496) mounted low to enforce protective stops, and a **navigation scanner** for SLAM/localization. Cheaper robots feed the safety scanner's measurement data to nav too, but the safety *function* is always isolated in a certified controller, never on the nav PC. **What sensors does an indoor AMR actually carry?** Typically a safety-rated 2D LiDAR (low), often a second nav LiDAR, two-plus depth cameras to catch off-plane obstacles (overhangs, low loads, floor edges), wheel encoders and an IMU for odometry, plus ultrasonic/cliff sensors for glass walls and dock edges that lasers miss. 3D LiDAR appears on outdoor/high-end units. See the [LiDAR & depth camera guide](/posts/lidar-depth-cameras-ultimate-guide/). **How does an AMR know where it is — what's SLAM vs AMCL?** SLAM (e.g. Cartographer, slam_toolbox) builds the map and estimates pose simultaneously, usually run once at commissioning. AMCL is a particle-filter localizer that tracks pose within a *saved* map at runtime — it scatters pose hypotheses, scores them against the live LiDAR scan, and converges. AMCL needs visible features; long featureless corridors are its classic failure mode. **What safety standards apply to mobile robots?** Industrial trucks/AGVs fall under **ISO 3691-4**; AMRs (industrial mobile robots) in North America under **ANSI/RIA R15.08**. Both require safety-rated scanners with speed-dependent protective fields, a hardware e-stop, and protective-stop functions rated to PL d (ISO 13849) or SIL 2 (IEC 62061). The safety function must live in a certified controller separate from the nav computer. **Why do safety fields change size while the robot moves?** Because stopping distance scales with the square of speed (`d ≈ v²/2a + v·t_react`). At 1.5 m/s a robot needs ~2 m of protective field ahead; at 0.3 m/s it needs only a fraction of that. Speed-dependent field switching keeps the field always larger than the current stopping distance while avoiding nuisance stops on nearby walls when the robot is slow. **Battery swap or opportunity charging?** Opportunity charging wins for most fleets. With LFP cells (tolerant of frequent partial charges), a robot topping up during natural dwell can run 20+ hours on a battery sized for ~2 hours of motion, with no spare batteries, swap station, or labor. Battery swap survives on heavy/legacy AGVs. The key variable is the ratio of charge windows to work — place chargers where robots naturally dwell. See [robot power & batteries](/posts/robot-power-batteries-ultimate-guide/). **Can I put a robot arm on a mobile base?** Yes — that's a mobile manipulator, usually an AMR base plus a collaborative arm (UR, Doosan, etc. — see the [cobots guide](/posts/collaborative-robots-cobots-ultimate-guide/)). The hard part is base-pose precision: ±3 cm localization is fine for transport but ruins a 6-DoF grasp, so you drive to a rough pose then use the wrist camera or a fiducial to refine the transform before manipulating. The base also needs a wider stance and stiffer frame to handle the arm's reach moment. **How many robots do I need?** Compute per-robot tasks/hour from cycle time (travel + load/unload) minus charging overhead, divide demand by that, then **add a congestion margin** (10–25%) because dense traffic erodes effective speed nonlinearly. See the fleet-sizing calc in [path planning](#path-planning). Always size for utilization, not just demand divided by per-robot rate. **What's the real cost driver in an AMR deployment?** Not the robots — they're often under half the project. The **integration tax** dominates: WMS/MES integration, commissioning, safety assessment, charger and Wi-Fi infrastructure, and training. Plan for it. Projects fail on low utilization (constant re-commissioning, nuisance stops, integration gaps), not on robot price. Well-matched warehouse fleets pay back in roughly 1.5–3 years. ## Changelog - **2026-05-16** — Initial publication. --- # Real-Time Robot Control Systems: The Ultimate Guide URL: https://blog.robo2u.com/posts/real-time-control-systems-ultimate-guide/ Published: 2026-05-14 Updated: 2026-06-20 Tags: real-time-control, rtos, preempt-rt, ethercat, control-loop, embedded-systems, determinism, jitter, robotics-hardware, guide Reading time: 36 min > A deep, practical guide to real-time robot control: hard vs soft real-time, jitter and worst-case latency, RTOS vs PREEMPT_RT Linux, the MCU/SBC split, EtherCAT and distributed clocks, ros2_control, and how to design and validate a deterministic control loop. A robot is a real-time system pretending to be a computer. The arm does not care that your laptop can do 40 GFLOPS; it cares that the current command for joint 4 arrives every 1.000 ms, on time, every single time, for the next eight hours. Miss one, and at best you get a velocity bump you can feel. Miss a few in a row at the wrong moment and you get a torque spike, a tripped drive, or a 30 kg payload going somewhere it should not. This guide is about closing the control loop *on time*. Not fast — on time. Those are different properties, and conflating them is the single most common mistake engineers make when they first build a robot controller. We will define real-time precisely, walk the multi-rate control hierarchy from the kHz current loop down to the 10 Hz planner, dig into where jitter actually comes from and how to measure it with `cyclictest`, compare the RTOS landscape against real-time Linux, settle the MCU-versus-SBC argument, and get concrete about EtherCAT distributed clocks, real-time code discipline, `ros2_control`, and time synchronization. Then we will design and validate a system end to end. **The take**: Real-time is about *worst-case* latency and bounded jitter, not throughput or average speed. The hard part of a robot is not making something happen quickly once — it is guaranteeing it happens within a deadline a million times in a row. The winning architecture in 2026 is almost always a split one: a microcontroller or smart drive holds the hard real-time current/torque loop at kHz with sub-microsecond jitter, while a Linux SBC running PREEMPT_RT handles the soft, complex, compute-heavy stuff — kinematics, planning, perception — and the two talk over a deterministic fieldbus like EtherCAT. Stop trying to run a 1 kHz torque loop in a ROS 2 node on stock Ubuntu. Put it where determinism is cheap. Companion reading: [motor controllers & FOC](/posts/motor-controllers-foc-ultimate-guide/), [motion planning & kinematics](/posts/motion-planning-kinematics-ultimate-guide/), [industrial automation, PLC, SCADA & fieldbus](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/), [ROS 2](/posts/ros2-ultimate-guide/), and [robot sensors](/posts/robot-sensors-ultimate-guide/). ## Table of contents 1. [Key takeaways](#tldr) 2. [What "real-time" actually means](#what-rt-means) 3. [Why robots need real-time](#why-robots) 4. [The robot control hierarchy and its rates](#hierarchy) 5. [Latency, jitter, and determinism](#latency-jitter) 6. [The RTOS landscape](#rtos-landscape) 7. [Real-time Linux](#rt-linux) 8. [The hardware split: MCU vs SoC/SBC](#hardware-split) 9. [Real-time fieldbuses](#fieldbus) 10. [Writing real-time code](#rt-code) 11. [ros2_control and real-time ROS 2](#ros2-control) 12. [Time sync and multi-rate coordination](#time-sync) 13. [Designing and validating a real-time system](#design-validate) 14. [Frequently asked questions](#faq) ## Key takeaways - **Real-time means deterministic, not fast.** A system is real-time if it meets its deadlines, every time, with a *bounded* worst case. A 100 MHz MCU with 2 µs jitter beats a 5 GHz CPU with 5 ms jitter for closing a control loop. - The metric that matters is **worst-case latency**, not average. Averages lie. A loop that runs in 80 µs on average but spikes to 4 ms once a minute is a broken 1 kHz loop. - **Hard / firm / soft** real-time differ by what a missed deadline costs: hard (catastrophe — torque loop, safety), firm (result is useless but no disaster), soft (degraded quality — perception, UI). Classify every loop in your robot before you choose hardware. - Robot control is a **multi-rate hierarchy**: current/torque loop at 10–40 kHz on the MCU or drive, joint/impedance control at 1–4 kHz, whole-body/MPC at 100 Hz–1 kHz, motion planning at 1–100 Hz, perception at 10–60 Hz. The fast loops live closest to the metal. - **Jitter is the enemy.** Its sources are interrupts, cache and TLB misses, scheduler decisions, SMIs, power management (C-states, frequency scaling), and memory contention. On stock Linux a single loop can see millisecond spikes; tuned PREEMPT_RT gets you into the tens of microseconds. - For MCUs, **bare-metal** gives the lowest, most predictable latency; an **RTOS** (FreeRTOS, Zephyr, RTEMS, VxWorks, QNX) buys you structure, drivers, and preemptive priority scheduling at the cost of a few microseconds of overhead. - **PREEMPT_RT is now mainline** (merged into the Linux 6.12 kernel in late 2024) and is genuinely good — tuned hardware delivers worst-case scheduling latency in the **10–50 µs** range. It is "good enough" for 1 kHz loops, not for a 20 kHz current loop. That belongs on silicon. - **EtherCAT won motion control** because of distributed clocks: every slave is synchronized to a shared clock with **< 1 µs** skew across the network, and a frame can service dozens of axes in tens of microseconds. CANopen still rules cost-sensitive and CiA 402 servo applications at lower rates. - **Real-time code has rules**: no `malloc`/`free`, no blocking syscalls, no unbounded loops, no page faults (lock memory with `mlockall`), use `SCHED_FIFO`/`SCHED_DEADLINE`, and use priority inheritance mutexes to defeat priority inversion. Know your WCET. - **ROS 2 nodes are not hard real-time**, but a real-time control loop can live inside a `ros2_control` controller manager thread running `SCHED_FIFO` on an isolated, shielded core — provided you keep DDS and allocation off the hot path. - **Synchronize your clocks.** PTP/IEEE 1588 gets distributed nodes to sub-microsecond agreement; EtherCAT distributed clocks do it on the bus. Timestamp sensor data at the source, not when ROS receives it. - **Validate, do not assume.** Run `cyclictest` for hours, log your loop's actual period and overrun count in production, and size your deadline budget with margin. A real-time system you have not measured under load is just a hopeful one. ## What "real-time" actually means Let me kill the most expensive misconception first: **real-time does not mean fast.** It means *on time*. A real-time system is one whose correctness depends not only on producing the right answer but on producing it within a defined deadline. A result that is correct but late is, by definition, a wrong result. This reframes everything. The question is never "how fast can this run?" It is "can this *guarantee* it finishes before the deadline, in the worst case, under worst-case load?" Throughput is a best-case, average-case concern. Real-time is a worst-case discipline. > **Rule**: In real-time engineering, the average is marketing and the maximum is truth. Always quote and budget against worst-case latency. ### Determinism is the property you actually want The technical word for "on time, every time" is **determinism**: the same input under the same conditions produces the same timing behavior, within a bounded window. A deterministic system has a worst-case execution time (WCET) and a worst-case response time you can actually compute or measure and rely on. A modern application CPU is built to maximize *average* throughput, and almost every trick it uses — out-of-order execution, deep speculation, multi-level caches, branch prediction, dynamic frequency scaling — trades determinism for speed. The result is a processor that is blisteringly fast on average and wildly variable instant to instant. That variability is poison for a control loop. A humble Cortex-M microcontroller running from tightly-coupled memory with the caches off does far less per second, but it does it with timing you can predict to the clock cycle. For closing a current loop, predictable beats fast every time. ### Hard, firm, and soft real-time The taxonomy comes down to what a missed deadline costs you: | Class | Missed deadline means | Examples in a robot | Typical home | |---|---|---|---| | **Hard** | Catastrophic / safety failure | Current/torque loop, motor commutation, safety-rated stop, brake control | MCU, smart drive, FPGA | | **Firm** | Result is useless but no disaster; degrades performance | Sensor fusion frame dropped, a single missed servo update | RTOS or PREEMPT_RT | | **Soft** | Quality degrades, value decays with lateness | Perception pipeline, path planning, teleop video, UI | Stock Linux, soft-RT threads | The mistake is to treat the whole robot as one class. A real robot is a *mix*: the commutation is hard, the impedance loop is firm-to-hard, the planner is soft, the GUI is best-effort. You architect each loop according to its class, and you spend your determinism budget where the cost of missing is highest. ### Real-time is not a kernel feature you switch on There is no checkbox. Real-time is an end-to-end property of the *entire* path from sensor edge to actuator command: the interrupt latency, the scheduler, the driver, the bus, the application code, even the power-management settings in firmware. A single non-deterministic component anywhere in that chain — a `malloc` in the hot loop, a network stack with unbounded retries, a CPU dropping into a deep C-state — destroys the determinism of the whole thing. You are only as real-time as your worst link. ## Why robots need real-time A robot is fundamentally a feedback control machine. It reads the world (encoders, IMUs, force sensors), computes a correction, and commands actuators — over and over, forever. Feedback control theory assumes a *fixed sample period* `T`. Your gains, your stability margins, your filter coefficients are all derived assuming the loop closes exactly every `T` seconds. The mathematics of a discrete PID or a state-space controller is built on that assumption. Break the assumption and you break the control. If the loop period wanders, your effective derivative gain wanders with it (D term divides by `dt`), your integrator accumulates wrong, and your phase margin erodes. Enough jitter and a perfectly-tuned loop oscillates or goes unstable. See the cascade structure in the [motor controllers & FOC guide](/posts/motor-controllers-foc-ultimate-guide/) — every one of those nested loops assumes a steady rate. ### What happens when a 1 kHz current loop misses Take a concrete case: a field-oriented current loop at 1 kHz on a motor drive, the inner loop of a servo (covered in depth in the [FOC guide](/posts/motor-controllers-foc-ultimate-guide/)). Its job is to regulate phase current — and therefore torque — by updating PWM duty every 1.000 ms based on the latest current measurement and rotor angle. Now it misses an update. For one period the PWM holds the *old* duty cycle. The rotor has moved; the dq-frame angle is stale; the d-axis and q-axis currents are no longer decoupled correctly. You inject a current component you did not intend. Best case: a small torque ripple and audible tick. Worse case: with the motor spinning fast, a 1 ms stale angle at, say, 3000 rpm is roughly 18 mechanical degrees — enough to push current well off-axis, spike phase current, trip the drive's overcurrent protection, and fault the axis mid-motion. Now imagine it is not the current loop but the *safety* path — the loop that watches a force-torque sensor and must command a stop within a deadline. A missed deadline there is not a tick; it is a person. > **Rule**: For a hard real-time loop, design so a single missed deadline is detected and handled (hold last command, fault safe), and so missing two in a row is impossible under your latency budget. Never assume misses do not happen — assume they do and bound the blast radius. ### The multi-rate reality No single rate fits a robot. You cannot run perception at 20 kHz (the camera does not produce frames that fast and the compute would melt), and you cannot run a current loop at 30 Hz (the motor would be uncontrollable). So robots are **multi-rate**: a hierarchy of nested loops running at rates spanning four orders of magnitude, each feeding setpoints to the loop beneath it. Getting the rates right, and putting each loop on the right hardware, is most of the architecture battle. ## The robot control hierarchy and its rates Think of robot control as a pyramid. The fast, simple, hard-real-time loops sit at the bottom, closest to the actuators. The slow, complex, compute-heavy, soft-real-time layers sit at the top. Each layer issues setpoints to the layer below at a rate the lower layer can absorb. | Layer | Typical rate | What it does | Where it runs | RT class | |---|---|---|---|---| | **Current / torque loop** | 10–40 kHz | FOC commutation, regulate phase current | MCU / smart drive / FPGA | Hard | | **Velocity loop** | 4–20 kHz | Regulate motor/joint speed | MCU / drive | Hard | | **Joint position / impedance** | 1–4 kHz | Track joint angle, render stiffness/damping | MCU or drive, sometimes SBC | Hard / firm | | **Whole-body control / MPC** | 100 Hz–1 kHz | Balance, contact forces, multi-joint coordination | SBC (PREEMPT_RT) | Firm | | **Motion planning / trajectory** | 1–100 Hz | Generate collision-free paths, retiming | SBC | Soft | | **Perception / state estimation** | 10–60 Hz | SLAM, object detection, sensor fusion | SBC / GPU (Jetson) | Soft | | **Task / behavior / mission** | 0.1–10 Hz | What to do next | SBC / cloud | Best-effort | A few things fall out of this table immediately. **The rate ratio between adjacent loops should be roughly 5–10×.** A velocity loop ten times faster than the position loop it serves can settle within one outer-loop period and looks like an ideal actuator to the layer above. This is the same cascade principle from the [FOC guide](/posts/motor-controllers-foc-ultimate-guide/), applied all the way up the stack. **The fast loops are simple, the slow loops are complex.** A current loop is two PI controllers and a couple of transforms — it is small enough to bound its WCET to the microsecond. An MPC solving a quadratic program over a 20-step horizon is thousands of times more code and its compute time depends on the problem; you give it a generous budget and a fallback. That complexity is exactly why it lives on a beefy SBC and not on the MCU. **Setpoint hand-off must be jitter-tolerant.** When the 500 Hz whole-body controller hands a torque setpoint to the 4 kHz joint loop, the joint loop runs eight times per setpoint. If a whole-body update is occasionally late, the joint loop simply holds the last setpoint for one more cycle — no harm, because the lower loop is the one that must be hard real-time. This is the architectural trick that lets you put the messy, hard-to-bound layers on a soft-RT OS without endangering the robot: **the higher you go, the more lateness you can tolerate, as long as the layer below holds steady.** For the layers above the controller — kinematics, retiming, collision checking — see the [motion planning & kinematics guide](/posts/motion-planning-kinematics-ultimate-guide/). Those run at human-ish rates and are squarely soft real-time. ### A worked example: a 6-axis arm A typical industrial-grade arm: each joint has a smart drive running a 16 kHz current loop and a 4 kHz velocity loop locally. The controller (an SBC on PREEMPT_RT) runs a 1 kHz joint trajectory loop, talking to all six drives over EtherCAT with a 1 ms cycle. Above that, a 100 Hz Cartesian layer and a 10–50 Hz planner. Notice how cleanly the rates separate, and how the hard real-time work (current and velocity) never leaves the drive. ## Latency, jitter, and determinism Three words get thrown around loosely. Let us pin them down because you will be measuring them. **Latency** is the delay from a trigger (timer fires, interrupt arrives) to the response (your code runs, the actuator updates). For a control loop the relevant latency is from the periodic timer tick to the start of your loop iteration. **Jitter** is the *variation* in that latency cycle to cycle. If your loop is supposed to run every 1000.0 µs but actually runs at intervals of 998, 1003, 999, 1001 µs, your jitter is a few microseconds peak-to-peak. Jitter, not average latency, is what destroys control quality — a consistent 50 µs delay you can compensate for; a delay that bounces between 5 µs and 500 µs you cannot. **Determinism** is having a *bounded* jitter and a known worst-case latency. A deterministic system can have high latency, as long as it is predictable. > **Rule**: A constant latency is a feature you can tune around. Jitter is a defect you must hunt down and bound. ### Where jitter comes from In rough order of how much pain each causes on a typical SBC: - **Power management.** CPU C-states (deep sleep) take microseconds to tens of microseconds to wake from; frequency scaling (P-states, turbo) changes how long your code takes to run. This is usually the single biggest jitter source on an untuned Linux box, and the first thing to kill: `cpuidle` deep states disabled, governor set to `performance`. - **Interrupts.** Any IRQ can preempt your loop. Network cards, disk, USB, and timers all fire interrupts. On Linux you move IRQs off your control core with `irqaffinity` and route the offenders elsewhere. - **Scheduler.** On a general-purpose OS the scheduler may not run your task the instant it is ready. This is exactly what PREEMPT_RT fixes — full kernel preemption so a high-priority RT task can preempt almost anything. - **System Management Interrupts (SMIs).** x86 firmware can steal the CPU into System Management Mode for hundreds of microseconds, invisibly to the OS, for thermal or power housekeeping. SMIs are the classic "where did that 300 µs spike come from?" culprit and a reason to vet your BIOS/board. ARM SBCs largely avoid this. - **Cache and TLB misses.** First touch of cold code or data costs a memory access. You mitigate by warming caches, locking memory, and keeping the hot path small. - **Memory contention and bus arbitration.** Other cores hammering DRAM, a DMA engine, or a GPU sharing the memory bus add variable stalls. - **Hypervisors / containers.** Virtualization adds a layer of scheduling you do not control. Run hard-RT on bare metal or with carefully pinned, isolated resources. ### Measuring it with cyclictest The standard tool is `cyclictest` (from `rt-tests`). It runs a high-priority thread that sleeps for a fixed interval, wakes, and measures the difference between the requested and actual wake time — i.e., your scheduling latency. Always run it *under load*, because an idle system tells you nothing about worst case. Pair it with `stress-ng` or the `hackbench` load generator. ```bash # Run on isolated core 3, SCHED_FIFO prio 80, 1 thread, 200 us interval, # lock memory, for 1 hour, while the box is hammered with load. sudo cyclictest --mlockall --priority=80 --interval=200 \ --affinity=3 --threads=1 --histogram=1000 --duration=1h ``` Typical output on a tuned PREEMPT_RT box: ``` # /dev/cpu_dma_latency set to 0us policy: fifo: loadavg: 8.21 7.95 6.30 12/843 30122 T: 0 (30119) P:80 I:200 C:18000000 Min: 2 Act: 4 Avg: 5 Max: 23 ``` Read that as: minimum latency 2 µs, average 5 µs, **maximum 23 µs** over 18 million samples. That `Max` is your number. It says: if you run a control loop on this core, budget at least 23 µs of scheduling jitter — and in practice add margin, because an hour is not forever and your real workload differs from `cyclictest`. On a stock, untuned kernel you might see `Max` in the **1000–8000 µs** range, which tells you instantly that a 1 kHz (1000 µs) loop is hopeless there. > **Rule**: Never report a real-time result without saying what load it ran under and for how long. A clean `cyclictest` on an idle machine is meaningless. ## The RTOS landscape On a microcontroller you have two structural choices: bare-metal or an RTOS. Both can be hard real-time; they differ in how you organize concurrency. **Bare-metal** (a `while(1)` superloop plus interrupt service routines) gives you the lowest, most predictable latency because there is no scheduler between your interrupt and your code. For a single tight control loop — a motor drive doing nothing but FOC — bare-metal is often the right answer and the easiest to reason about for WCET. The downside is that as you add concurrent activities (comms, logging, a second loop) the superloop becomes a tangle of state machines, and you lose preemptive prioritization. **An RTOS** gives you preemptive priority-based scheduling, threads, and synchronization primitives, so a high-priority control task always preempts low-priority background work. You pay a few microseconds of context-switch and scheduler overhead and a few KB of RAM. For anything beyond a single loop, the structure is usually worth it. | RTOS | License | Footprint | Scheduling | Strengths | Typical use | |---|---|---|---|---|---| | **FreeRTOS** | MIT | ~6–12 KB | Preemptive priority + optional time-slice | Ubiquitous, tiny, huge ecosystem, Amazon-backed | The default small-MCU RTOS; STM32, ESP32, etc. | | **Zephyr** | Apache 2.0 | ~8 KB+ | Preemptive + cooperative, tickless | Modern, Linux-Foundation, rich drivers, networking, Kconfig/devicetree | New designs wanting connectivity and structure | | **RTEMS** | BSD-ish | Medium | Preemptive priority | Hard-RT pedigree, POSIX, used in aerospace/space | Spacecraft, scientific instruments | | **VxWorks** | Commercial | Medium–large | Preemptive priority | Battle-tested, certifiable (DO-178C), strong tooling | Aerospace, defense, medical, industrial | | **QNX** | Commercial | Large (microkernel) | Preemptive, microkernel + adaptive partitioning | Microkernel robustness, POSIX, safety certs | Automotive, medical, robotics requiring certification | | **Bare-metal** | n/a | Minimal | ISRs + superloop | Lowest, most predictable latency; trivial WCET | Single tight control loops, motor drives | A few opinions. **FreeRTOS** is the sensible default for a small Cortex-M doing a control loop plus housekeeping — it is everywhere, the kernel is small enough to read in an afternoon, and interrupt latency is dominated by your hardware, not the kernel. **Zephyr** is what I reach for on a new design that needs networking, a real driver model, and a build system that scales — it has matured a lot and the devicetree-driven HAL is genuinely good once you climb the learning curve. **VxWorks and QNX** earn their license fees only when you need formal safety certification or vendor support contracts; otherwise the open options are fine. **RTEMS** is the quiet workhorse if you are anywhere near space or scientific instrumentation. On a real MCU, the RTOS is rarely your latency bottleneck. Your interrupt latency, your DMA setup, and whether you left the data cache on are. A Cortex-M7 servicing an interrupt from TCM with the right priority configuration responds in well under a microsecond; the RTOS scheduler adds maybe 1–3 µs to do a context switch. Compared to the millisecond-scale chaos of an untuned Linux box, MCU-class determinism is in another league entirely — which is exactly why the hard loops live there. ## Real-time Linux Linux was not built for real-time. Its scheduler optimizes throughput and fairness, large sections of the kernel historically ran with preemption disabled, and a low-priority task holding a lock could block a high-priority one for milliseconds. Out of the box, Linux is a soft real-time system at best — fine for perception and planning, useless for a 1 kHz loop. Three approaches fix this, in increasing order of intrusiveness. | Approach | How it works | Worst-case latency (tuned) | Pros | Cons | |---|---|---|---|---| | **Stock Linux + tuning** | `isolcpus`, RT priorities, IRQ affinity, disable C-states | ~100s of µs to low ms | No patch, easy | Not truly bounded; spikes remain | | **PREEMPT_RT** (mainline) | Makes nearly all kernel code preemptible; threaded IRQs; priority-inheritance mutexes; high-res timers | **~10–50 µs** | Single kernel, full Linux API, mainline since 6.12 | Slightly lower throughput; still not MCU-class | | **Xenomai (dual kernel / Cobalt)** | A small co-kernel runs RT tasks beneath Linux; Linux is the idle task | **~1–10 µs** | Hardest determinism available on Linux | Dual API, more complex, separate driver stack | | **RTAI** | Older dual-kernel co-kernel | low µs | Very low latency | Niche, smaller community today | ### PREEMPT_RT: "Linux isn't real-time, but PREEMPT_RT is good enough" For most robots in 2026, **PREEMPT_RT is the answer**, and the big news is that after roughly two decades as an out-of-tree patch set, the core of it landed in the mainline kernel (6.12, late 2024). You no longer have to chase a patch against your kernel version; you enable `CONFIG_PREEMPT_RT` and go. That is a genuine milestone — real-time Linux is now a first-class citizen. What PREEMPT_RT does, mechanically: it converts almost all kernel locks into preemptible, priority-inheriting mutexes, runs interrupt handlers as threads you can prioritize and pin, and makes nearly the entire kernel preemptible. The result is that a high-priority `SCHED_FIFO` task can preempt the kernel itself, so its wake-up latency stops depending on whatever the kernel happened to be doing. On well-chosen, tuned hardware — meaning a board without nasty SMIs, with C-states and frequency scaling locked down, IRQs steered away, and a dedicated isolated core — you get worst-case scheduling latency in the **10–50 µs** band. That comfortably supports a 1 kHz (1000 µs) loop with two orders of magnitude of margin, and even a 4 kHz (250 µs) loop with care. It does *not* reliably support a 20 kHz (50 µs) loop — your jitter would be a large fraction of your period. Those stay on the MCU. ### CPU shielding: isolcpus and friends The other half of the recipe is keeping the general-purpose OS off your control core. The pattern: - **`isolcpus=3` (and/or `nohz_full=3`, `rcu_nocbs=3`)** as kernel boot parameters — removes core 3 from the general scheduler's balancing and offloads RCU and the scheduler tick from it. - **Pin your RT thread to core 3** with `pthread_setaffinity_np` or `taskset`. - **Steer interrupts away** from core 3 via `irqaffinity` or `/proc/irq/*/smp_affinity`. - **Disable deep C-states** by writing to `/dev/cpu_dma_latency`, and set the CPU governor to `performance`. The effect: core 3 becomes a near-private compute resource where your loop runs almost undisturbed, while cores 0–2 run Linux, ROS, logging, and everything else. This is the single highest-leverage tuning step on a Linux robot controller. > **Rule**: PREEMPT_RT plus an isolated, shielded core plus locked-down power management gets a Linux box to where a 1 kHz loop is solid. Skip any one of the three and your worst case will eventually bite you. ## The hardware split: MCU vs SoC/SBC Here is the design decision that organizes everything else: **what runs on the microcontroller and what runs on the application processor?** The answer follows directly from real-time class. The hard real-time, kHz, simple, bounded-WCET work goes on a **microcontroller or smart drive**: an STM32 (Cortex-M), a TI C2000 (purpose-built for motor control, with its trig accelerator and high-res PWM), or an FPGA for the extreme cases. These chips have deterministic interrupt latency, no MMU games, no OS to fight, and direct hardware control of PWM and ADC sampling synchronized to the switching edge. A C2000 doing a 20 kHz FOC loop has jitter measured in *nanoseconds*. You cannot buy that on a Linux SBC at any price, because the architecture works against you. The soft real-time, compute-heavy, complex work goes on a **SoC / SBC**: a Jetson (Orin or Thor), a Raspberry Pi 5, an x86 box, an i.MX8. These run Linux, have gigabytes of RAM, GPUs for perception, full networking, and the development convenience of a real OS. They run kinematics, planning, perception, state estimation, and the supervisory control layer. > **Rule**: Put hard real-time where determinism is cheap (the MCU). Put complexity where determinism is expensive but compute is cheap (the SBC). Never invert this. ### Jetson + MCU co-design The canonical robot brain is a **co-designed pair**: a Jetson for perception and high-level control, plus one or more microcontrollers or smart drives for the actual loops, connected by EtherCAT, CAN/CANopen, or a custom SPI/UART link. The Jetson never closes a torque loop. It sends setpoints — joint targets, Cartesian goals, gait parameters — at 100 Hz to 1 kHz, and the MCUs turn those into the kHz current commands that move the motors. This is exactly how modern humanoids and quadrupeds are built. See the [humanoid robot hardware guide](/posts/humanoid-robot-hardware-ultimate-guide/) and the [legged quadruped robot hardware guide](/posts/legged-quadruped-robot-hardware-ultimate-guide/): a central compute (often Jetson Orin/Thor class) runs the whole-body controller and perception at a few hundred Hz to ~1 kHz, while each leg/joint actuator embeds its own MCU running the current loop at 20–40 kHz. The high-level brain can stutter for a few milliseconds during a perception spike and the robot stays upright, because the joint-level loops never miss. ### Why not just run everything on the SBC? People try. They put a 1 kHz loop in a Linux thread, see it mostly works, ship it, and then field a robot that occasionally faults a drive when the WiFi stack does something interesting or a log flush stalls. Even with PREEMPT_RT, the SBC is the *less* deterministic half of the system, and the higher your loop rate the more its jitter eats your period. The MCU is not a compromise you make for cost — it is the right tool. A $3 STM32 closes a loop more reliably than a $2000 GPU board, and that is not changing. The SBC side keeps getting more capable, so the loops that can reasonably live on Linux creep upward, but the bottom of the pyramid — the current loop tied to PWM switching at tens of kHz — is not moving off silicon. ## Real-time fieldbuses Once you have multiple smart drives and an SBC, they have to talk — deterministically. A standard Ethernet switch with TCP/IP is hopeless for this: variable latency, retransmissions, no synchronization. Real-time fieldbuses solve it. The deep dive on industrial networking lives in the [industrial automation, PLC, SCADA & fieldbus guide](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/); here is the control engineer's view. ### EtherCAT and the distributed-clock trick **EtherCAT won motion control**, and it won for two reasons: processing-on-the-fly and distributed clocks. *Processing on the fly* means the master sends one Ethernet frame that travels down the daisy-chain of slaves, and each slave reads its outputs and writes its inputs *as the frame passes through its hardware*, with nanosecond-scale delay, then forwards it on. One frame services the entire network. There is no per-slave round trip. This is why EtherCAT can service 100 axes in roughly 100 µs and sustain cycle times of **50 µs–1 ms** across a real machine. *Distributed clocks* (DC) are the genuinely clever part. One slave's clock is the reference, and the master measures the propagation delay to every other slave (down to nanoseconds) and continuously disciplines every slave's local clock to the reference. The result: all slaves share a common time base synchronized to **well under 1 µs** of skew — often < 100 ns. Each drive then latches its actuator command and samples its feedback at the *same instant network-wide*, triggered by a DC sync interrupt rather than by frame arrival. That removes the jitter of the communication itself from the control timing. Two motors on opposite ends of a 30-node chain step in lockstep. Cycle-time math for sizing a network: ``` Per-axis EtherCAT process data: ~12 bytes (e.g. CiA 402: control word, target position, status word, actual position). Ethernet frame overhead : 38 bytes (preamble, SOF, header, CRC, IFG) EtherCAT frame header : 12 bytes Per-slave datagram overhead: ~12 bytes 6 axes x (12 data + 12 overhead) = 144 bytes of process data Total frame ~ 144 + 50 = 194 bytes -> ~1552 bits At 100 Mbit/s: 1552 bits / 100e6 = 15.5 us on the wire Add ~1 us per slave forwarding delay x 6 = 6 us Total bus time ~ 22 us -> fits comfortably in a 250 us (4 kHz) cycle ``` That headroom is why a single EtherCAT master on an SBC can run a 1–4 kHz process-data cycle to a dozen drives with margin to spare. Common open masters: **IgH EtherCAT Master (EtherLab)**, **SOEM** (Simple Open EtherCAT Master, great for embedded), and the EtherCAT support inside `ros2_control`. ### CANopen and CAN **CANopen** (CiA 301 application layer, CiA 402 for drives) runs over CAN at up to 1 Mbit/s (or a few Mbit/s with CAN FD). It is slower than EtherCAT and shares one bus among all nodes, so it is event- and priority-arbitrated rather than cyclically scheduled. Realistic deterministic cycle times are **1–10 ms** for a handful of axes. CANopen still dominates cost-sensitive servo and industrial applications, and CiA 402 is the lingua franca of drive profiles — even many EtherCAT drives speak CiA 402 over EtherCAT (CoE). For robots with modest axis counts and rates, plain CAN/CANopen is often plenty, and the wiring is dead simple. ### The quick comparison | Bus | Sync / cycle | Determinism | Topology | Best for | |---|---|---|---|---| | **EtherCAT** | 50 µs–1 ms, DC < 1 µs skew | Excellent | Daisy chain / ring | High-axis-count, high-rate motion control | | **CANopen** | 1–10 ms typical | Good (arbitrated) | Bus | Cost-sensitive servo, moderate rates | | **EtherNet/IP (CIP Sync)** | ~1 ms+ | Good with PTP | Star | Factory/PLC ecosystems | | **PROFINET IRT** | 250 µs–1 ms | Excellent (IRT) | Line/star | Siemens/PLC ecosystems | | **SERCOS III** | 31.25 µs–1 ms | Excellent | Ring | High-end CNC/motion | For a robot built from scratch, the choice is usually EtherCAT (if you need many fast axes or want sub-microsecond sync) or CANopen (if you want cheap and simple at lower rates). The PLC-ecosystem buses matter when you are integrating into an existing factory line — that is the [industrial automation guide](/posts/industrial-automation-plc-scada-fieldbus-ultimate-guide/)'s territory. ## Writing real-time code A deterministic OS and bus get you nothing if the code in the hot loop is non-deterministic. Real-time code is a discipline, and the rules are non-negotiable on the hard path. ### The forbidden list > **Rule**: In the real-time path — no dynamic memory, no blocking, no unbounded work, no page faults. Everything the loop touches must have a bounded, known cost. - **No `malloc`/`free`/`new`/`delete`.** The allocator can take a lock, walk a free list, or call the kernel for more pages — all unbounded. Allocate everything up front, before the loop starts. Use pre-sized pools and ring buffers. - **No blocking syscalls.** No `printf` to a terminal, no file I/O, no `sleep` other than your loop's timed wait, no socket calls that can block. Logging happens by writing to a lock-free ring buffer that a *separate, lower-priority* thread drains and writes out. - **No unbounded loops or recursion.** Every loop must have a compile-time or load-time bound. No "iterate until converged" without a hard iteration cap. - **No page faults.** A page fault is a trip to the kernel and possibly to disk — milliseconds. Lock all memory resident with `mlockall` and pre-fault your stack and heap. - **Bounded WCET.** You should be able to state the worst-case execution time of the loop body and show it is comfortably under the period. ### Locking memory and setting up the RT thread The standard setup for a Linux RT control thread: ```c #include