Real-Time Robot Control Systems: The Ultimate Guide
A deep, practical guide to real-time robot control: hard vs soft real-time, jitter and worst-case latency, RTOS vs PREEMPT_RT Linux, the MCU/SBC split, EtherCAT and distributed clocks, ros2_control, and how to design and validate a deterministic control loop.
A robot is a real-time system pretending to be a computer. The arm does not care that your laptop can do 40 GFLOPS; it cares that the current command for joint 4 arrives every 1.000 ms, on time, every single time, for the next eight hours. Miss one, and at best you get a velocity bump you can feel. Miss a few in a row at the wrong moment and you get a torque spike, a tripped drive, or a 30 kg payload going somewhere it should not.
This guide is about closing the control loop on time. Not fast — on time. Those are different properties, and conflating them is the single most common mistake engineers make when they first build a robot controller. We will define real-time precisely, walk the multi-rate control hierarchy from the kHz current loop down to the 10 Hz planner, dig into where jitter actually comes from and how to measure it with cyclictest, compare the RTOS landscape against real-time Linux, settle the MCU-versus-SBC argument, and get concrete about EtherCAT distributed clocks, real-time code discipline, ros2_control, and time synchronization. Then we will design and validate a system end to end.
The take: Real-time is about worst-case latency and bounded jitter, not throughput or average speed. The hard part of a robot is not making something happen quickly once — it is guaranteeing it happens within a deadline a million times in a row. The winning architecture in 2026 is almost always a split one: a microcontroller or smart drive holds the hard real-time current/torque loop at kHz with sub-microsecond jitter, while a Linux SBC running PREEMPT_RT handles the soft, complex, compute-heavy stuff — kinematics, planning, perception — and the two talk over a deterministic fieldbus like EtherCAT. Stop trying to run a 1 kHz torque loop in a ROS 2 node on stock Ubuntu. Put it where determinism is cheap.
Companion reading: motor controllers & FOC, motion planning & kinematics, industrial automation, PLC, SCADA & fieldbus, ROS 2, and robot sensors.
Table of contents
- Key takeaways
- What "real-time" actually means
- Why robots need real-time
- The robot control hierarchy and its rates
- Latency, jitter, and determinism
- The RTOS landscape
- Real-time Linux
- The hardware split: MCU vs SoC/SBC
- Real-time fieldbuses
- Writing real-time code
- ros2_control and real-time ROS 2
- Time sync and multi-rate coordination
- Designing and validating a real-time system
- Frequently asked questions
What "real-time" actually means
Let me kill the most expensive misconception first: real-time does not mean fast. It means on time. A real-time system is one whose correctness depends not only on producing the right answer but on producing it within a defined deadline. A result that is correct but late is, by definition, a wrong result.
This reframes everything. The question is never "how fast can this run?" It is "can this guarantee it finishes before the deadline, in the worst case, under worst-case load?" Throughput is a best-case, average-case concern. Real-time is a worst-case discipline.
Rule: In real-time engineering, the average is marketing and the maximum is truth. Always quote and budget against worst-case latency.
Determinism is the property you actually want
The technical word for "on time, every time" is determinism: the same input under the same conditions produces the same timing behavior, within a bounded window. A deterministic system has a worst-case execution time (WCET) and a worst-case response time you can actually compute or measure and rely on.
A modern application CPU is built to maximize average throughput, and almost every trick it uses — out-of-order execution, deep speculation, multi-level caches, branch prediction, dynamic frequency scaling — trades determinism for speed. The result is a processor that is blisteringly fast on average and wildly variable instant to instant. That variability is poison for a control loop.
A humble Cortex-M microcontroller running from tightly-coupled memory with the caches off does far less per second, but it does it with timing you can predict to the clock cycle. For closing a current loop, predictable beats fast every time.
Hard, firm, and soft real-time
The taxonomy comes down to what a missed deadline costs you:
| Class | Missed deadline means | Examples in a robot | Typical home |
|---|---|---|---|
| Hard | Catastrophic / safety failure | Current/torque loop, motor commutation, safety-rated stop, brake control | MCU, smart drive, FPGA |
| Firm | Result is useless but no disaster; degrades performance | Sensor fusion frame dropped, a single missed servo update | RTOS or PREEMPT_RT |
| Soft | Quality degrades, value decays with lateness | Perception pipeline, path planning, teleop video, UI | Stock Linux, soft-RT threads |
The mistake is to treat the whole robot as one class. A real robot is a mix: the commutation is hard, the impedance loop is firm-to-hard, the planner is soft, the GUI is best-effort. You architect each loop according to its class, and you spend your determinism budget where the cost of missing is highest.
Real-time is not a kernel feature you switch on
There is no checkbox. Real-time is an end-to-end property of the entire path from sensor edge to actuator command: the interrupt latency, the scheduler, the driver, the bus, the application code, even the power-management settings in firmware. A single non-deterministic component anywhere in that chain — a malloc in the hot loop, a network stack with unbounded retries, a CPU dropping into a deep C-state — destroys the determinism of the whole thing. You are only as real-time as your worst link.
Why robots need real-time
A robot is fundamentally a feedback control machine. It reads the world (encoders, IMUs, force sensors), computes a correction, and commands actuators — over and over, forever. Feedback control theory assumes a fixed sample period T. Your gains, your stability margins, your filter coefficients are all derived assuming the loop closes exactly every T seconds. The mathematics of a discrete PID or a state-space controller is built on that assumption.
Break the assumption and you break the control. If the loop period wanders, your effective derivative gain wanders with it (D term divides by dt), your integrator accumulates wrong, and your phase margin erodes. Enough jitter and a perfectly-tuned loop oscillates or goes unstable. See the cascade structure in the motor controllers & FOC guide — every one of those nested loops assumes a steady rate.
What happens when a 1 kHz current loop misses
Take a concrete case: a field-oriented current loop at 1 kHz on a motor drive, the inner loop of a servo (covered in depth in the FOC guide). Its job is to regulate phase current — and therefore torque — by updating PWM duty every 1.000 ms based on the latest current measurement and rotor angle.
Now it misses an update. For one period the PWM holds the old duty cycle. The rotor has moved; the dq-frame angle is stale; the d-axis and q-axis currents are no longer decoupled correctly. You inject a current component you did not intend. Best case: a small torque ripple and audible tick. Worse case: with the motor spinning fast, a 1 ms stale angle at, say, 3000 rpm is roughly 18 mechanical degrees — enough to push current well off-axis, spike phase current, trip the drive's overcurrent protection, and fault the axis mid-motion.
Now imagine it is not the current loop but the safety path — the loop that watches a force-torque sensor and must command a stop within a deadline. A missed deadline there is not a tick; it is a person.
Rule: For a hard real-time loop, design so a single missed deadline is detected and handled (hold last command, fault safe), and so missing two in a row is impossible under your latency budget. Never assume misses do not happen — assume they do and bound the blast radius.
The multi-rate reality
No single rate fits a robot. You cannot run perception at 20 kHz (the camera does not produce frames that fast and the compute would melt), and you cannot run a current loop at 30 Hz (the motor would be uncontrollable). So robots are multi-rate: a hierarchy of nested loops running at rates spanning four orders of magnitude, each feeding setpoints to the loop beneath it. Getting the rates right, and putting each loop on the right hardware, is most of the architecture battle.
The robot control hierarchy and its rates
Think of robot control as a pyramid. The fast, simple, hard-real-time loops sit at the bottom, closest to the actuators. The slow, complex, compute-heavy, soft-real-time layers sit at the top. Each layer issues setpoints to the layer below at a rate the lower layer can absorb.
| Layer | Typical rate | What it does | Where it runs | RT class |
|---|---|---|---|---|
| Current / torque loop | 10–40 kHz | FOC commutation, regulate phase current | MCU / smart drive / FPGA | Hard |
| Velocity loop | 4–20 kHz | Regulate motor/joint speed | MCU / drive | Hard |
| Joint position / impedance | 1–4 kHz | Track joint angle, render stiffness/damping | MCU or drive, sometimes SBC | Hard / firm |
| Whole-body control / MPC | 100 Hz–1 kHz | Balance, contact forces, multi-joint coordination | SBC (PREEMPT_RT) | Firm |
| Motion planning / trajectory | 1–100 Hz | Generate collision-free paths, retiming | SBC | Soft |
| Perception / state estimation | 10–60 Hz | SLAM, object detection, sensor fusion | SBC / GPU (Jetson) | Soft |
| Task / behavior / mission | 0.1–10 Hz | What to do next | SBC / cloud | Best-effort |
A few things fall out of this table immediately.
The rate ratio between adjacent loops should be roughly 5–10×. A velocity loop ten times faster than the position loop it serves can settle within one outer-loop period and looks like an ideal actuator to the layer above. This is the same cascade principle from the FOC guide, applied all the way up the stack.
The fast loops are simple, the slow loops are complex. A current loop is two PI controllers and a couple of transforms — it is small enough to bound its WCET to the microsecond. An MPC solving a quadratic program over a 20-step horizon is thousands of times more code and its compute time depends on the problem; you give it a generous budget and a fallback. That complexity is exactly why it lives on a beefy SBC and not on the MCU.
Setpoint hand-off must be jitter-tolerant. When the 500 Hz whole-body controller hands a torque setpoint to the 4 kHz joint loop, the joint loop runs eight times per setpoint. If a whole-body update is occasionally late, the joint loop simply holds the last setpoint for one more cycle — no harm, because the lower loop is the one that must be hard real-time. This is the architectural trick that lets you put the messy, hard-to-bound layers on a soft-RT OS without endangering the robot: the higher you go, the more lateness you can tolerate, as long as the layer below holds steady.
For the layers above the controller — kinematics, retiming, collision checking — see the motion planning & kinematics guide. Those run at human-ish rates and are squarely soft real-time.
A worked example: a 6-axis arm
A typical industrial-grade arm: each joint has a smart drive running a 16 kHz current loop and a 4 kHz velocity loop locally. The controller (an SBC on PREEMPT_RT) runs a 1 kHz joint trajectory loop, talking to all six drives over EtherCAT with a 1 ms cycle. Above that, a 100 Hz Cartesian layer and a 10–50 Hz planner. Notice how cleanly the rates separate, and how the hard real-time work (current and velocity) never leaves the drive.
Latency, jitter, and determinism
Three words get thrown around loosely. Let us pin them down because you will be measuring them.
Latency is the delay from a trigger (timer fires, interrupt arrives) to the response (your code runs, the actuator updates). For a control loop the relevant latency is from the periodic timer tick to the start of your loop iteration.
Jitter is the variation in that latency cycle to cycle. If your loop is supposed to run every 1000.0 µs but actually runs at intervals of 998, 1003, 999, 1001 µs, your jitter is a few microseconds peak-to-peak. Jitter, not average latency, is what destroys control quality — a consistent 50 µs delay you can compensate for; a delay that bounces between 5 µs and 500 µs you cannot.
Determinism is having a bounded jitter and a known worst-case latency. A deterministic system can have high latency, as long as it is predictable.
Rule: A constant latency is a feature you can tune around. Jitter is a defect you must hunt down and bound.
Where jitter comes from
In rough order of how much pain each causes on a typical SBC:
- Power management. CPU C-states (deep sleep) take microseconds to tens of microseconds to wake from; frequency scaling (P-states, turbo) changes how long your code takes to run. This is usually the single biggest jitter source on an untuned Linux box, and the first thing to kill:
cpuidledeep states disabled, governor set toperformance. - Interrupts. Any IRQ can preempt your loop. Network cards, disk, USB, and timers all fire interrupts. On Linux you move IRQs off your control core with
irqaffinityand route the offenders elsewhere. - Scheduler. On a general-purpose OS the scheduler may not run your task the instant it is ready. This is exactly what PREEMPT_RT fixes — full kernel preemption so a high-priority RT task can preempt almost anything.
- System Management Interrupts (SMIs). x86 firmware can steal the CPU into System Management Mode for hundreds of microseconds, invisibly to the OS, for thermal or power housekeeping. SMIs are the classic "where did that 300 µs spike come from?" culprit and a reason to vet your BIOS/board. ARM SBCs largely avoid this.
- Cache and TLB misses. First touch of cold code or data costs a memory access. You mitigate by warming caches, locking memory, and keeping the hot path small.
- Memory contention and bus arbitration. Other cores hammering DRAM, a DMA engine, or a GPU sharing the memory bus add variable stalls.
- Hypervisors / containers. Virtualization adds a layer of scheduling you do not control. Run hard-RT on bare metal or with carefully pinned, isolated resources.
Measuring it with cyclictest
The standard tool is cyclictest (from rt-tests). It runs a high-priority thread that sleeps for a fixed interval, wakes, and measures the difference between the requested and actual wake time — i.e., your scheduling latency. Always run it under load, because an idle system tells you nothing about worst case. Pair it with stress-ng or the hackbench load generator.
# Run on isolated core 3, SCHED_FIFO prio 80, 1 thread, 200 us interval,
# lock memory, for 1 hour, while the box is hammered with load.
sudo cyclictest --mlockall --priority=80 --interval=200 \
--affinity=3 --threads=1 --histogram=1000 --duration=1h
Typical output on a tuned PREEMPT_RT box:
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 8.21 7.95 6.30 12/843 30122
T: 0 (30119) P:80 I:200 C:18000000 Min: 2 Act: 4 Avg: 5 Max: 23
Read that as: minimum latency 2 µs, average 5 µs, maximum 23 µs over 18 million samples. That Max is your number. It says: if you run a control loop on this core, budget at least 23 µs of scheduling jitter — and in practice add margin, because an hour is not forever and your real workload differs from cyclictest. On a stock, untuned kernel you might see Max in the 1000–8000 µs range, which tells you instantly that a 1 kHz (1000 µs) loop is hopeless there.
Rule: Never report a real-time result without saying what load it ran under and for how long. A clean
cyclicteston an idle machine is meaningless.
The RTOS landscape
On a microcontroller you have two structural choices: bare-metal or an RTOS. Both can be hard real-time; they differ in how you organize concurrency.
Bare-metal (a while(1) superloop plus interrupt service routines) gives you the lowest, most predictable latency because there is no scheduler between your interrupt and your code. For a single tight control loop — a motor drive doing nothing but FOC — bare-metal is often the right answer and the easiest to reason about for WCET. The downside is that as you add concurrent activities (comms, logging, a second loop) the superloop becomes a tangle of state machines, and you lose preemptive prioritization.
An RTOS gives you preemptive priority-based scheduling, threads, and synchronization primitives, so a high-priority control task always preempts low-priority background work. You pay a few microseconds of context-switch and scheduler overhead and a few KB of RAM. For anything beyond a single loop, the structure is usually worth it.
| RTOS | License | Footprint | Scheduling | Strengths | Typical use |
|---|---|---|---|---|---|
| FreeRTOS | MIT | ~6–12 KB | Preemptive priority + optional time-slice | Ubiquitous, tiny, huge ecosystem, Amazon-backed | The default small-MCU RTOS; STM32, ESP32, etc. |
| Zephyr | Apache 2.0 | ~8 KB+ | Preemptive + cooperative, tickless | Modern, Linux-Foundation, rich drivers, networking, Kconfig/devicetree | New designs wanting connectivity and structure |
| RTEMS | BSD-ish | Medium | Preemptive priority | Hard-RT pedigree, POSIX, used in aerospace/space | Spacecraft, scientific instruments |
| VxWorks | Commercial | Medium–large | Preemptive priority | Battle-tested, certifiable (DO-178C), strong tooling | Aerospace, defense, medical, industrial |
| QNX | Commercial | Large (microkernel) | Preemptive, microkernel + adaptive partitioning | Microkernel robustness, POSIX, safety certs | Automotive, medical, robotics requiring certification |
| Bare-metal | n/a | Minimal | ISRs + superloop | Lowest, most predictable latency; trivial WCET | Single tight control loops, motor drives |
A few opinions. FreeRTOS is the sensible default for a small Cortex-M doing a control loop plus housekeeping — it is everywhere, the kernel is small enough to read in an afternoon, and interrupt latency is dominated by your hardware, not the kernel. Zephyr is what I reach for on a new design that needs networking, a real driver model, and a build system that scales — it has matured a lot and the devicetree-driven HAL is genuinely good once you climb the learning curve. VxWorks and QNX earn their license fees only when you need formal safety certification or vendor support contracts; otherwise the open options are fine. RTEMS is the quiet workhorse if you are anywhere near space or scientific instrumentation.
On a real MCU, the RTOS is rarely your latency bottleneck. Your interrupt latency, your DMA setup, and whether you left the data cache on are. A Cortex-M7 servicing an interrupt from TCM with the right priority configuration responds in well under a microsecond; the RTOS scheduler adds maybe 1–3 µs to do a context switch. Compared to the millisecond-scale chaos of an untuned Linux box, MCU-class determinism is in another league entirely — which is exactly why the hard loops live there.
Real-time Linux
Linux was not built for real-time. Its scheduler optimizes throughput and fairness, large sections of the kernel historically ran with preemption disabled, and a low-priority task holding a lock could block a high-priority one for milliseconds. Out of the box, Linux is a soft real-time system at best — fine for perception and planning, useless for a 1 kHz loop.
Three approaches fix this, in increasing order of intrusiveness.
| Approach | How it works | Worst-case latency (tuned) | Pros | Cons |
|---|---|---|---|---|
| Stock Linux + tuning | isolcpus, RT priorities, IRQ affinity, disable C-states |
~100s of µs to low ms | No patch, easy | Not truly bounded; spikes remain |
| PREEMPT_RT (mainline) | Makes nearly all kernel code preemptible; threaded IRQs; priority-inheritance mutexes; high-res timers | ~10–50 µs | Single kernel, full Linux API, mainline since 6.12 | Slightly lower throughput; still not MCU-class |
| Xenomai (dual kernel / Cobalt) | A small co-kernel runs RT tasks beneath Linux; Linux is the idle task | ~1–10 µs | Hardest determinism available on Linux | Dual API, more complex, separate driver stack |
| RTAI | Older dual-kernel co-kernel | low µs | Very low latency | Niche, smaller community today |
PREEMPT_RT: "Linux isn't real-time, but PREEMPT_RT is good enough"
For most robots in 2026, PREEMPT_RT is the answer, and the big news is that after roughly two decades as an out-of-tree patch set, the core of it landed in the mainline kernel (6.12, late 2024). You no longer have to chase a patch against your kernel version; you enable CONFIG_PREEMPT_RT and go. That is a genuine milestone — real-time Linux is now a first-class citizen.
What PREEMPT_RT does, mechanically: it converts almost all kernel locks into preemptible, priority-inheriting mutexes, runs interrupt handlers as threads you can prioritize and pin, and makes nearly the entire kernel preemptible. The result is that a high-priority SCHED_FIFO task can preempt the kernel itself, so its wake-up latency stops depending on whatever the kernel happened to be doing.
On well-chosen, tuned hardware — meaning a board without nasty SMIs, with C-states and frequency scaling locked down, IRQs steered away, and a dedicated isolated core — you get worst-case scheduling latency in the 10–50 µs band. That comfortably supports a 1 kHz (1000 µs) loop with two orders of magnitude of margin, and even a 4 kHz (250 µs) loop with care. It does not reliably support a 20 kHz (50 µs) loop — your jitter would be a large fraction of your period. Those stay on the MCU.
CPU shielding: isolcpus and friends
The other half of the recipe is keeping the general-purpose OS off your control core. The pattern:
isolcpus=3(and/ornohz_full=3,rcu_nocbs=3) as kernel boot parameters — removes core 3 from the general scheduler's balancing and offloads RCU and the scheduler tick from it.- Pin your RT thread to core 3 with
pthread_setaffinity_nportaskset. - Steer interrupts away from core 3 via
irqaffinityor/proc/irq/*/smp_affinity. - Disable deep C-states by writing to
/dev/cpu_dma_latency, and set the CPU governor toperformance.
The effect: core 3 becomes a near-private compute resource where your loop runs almost undisturbed, while cores 0–2 run Linux, ROS, logging, and everything else. This is the single highest-leverage tuning step on a Linux robot controller.
Rule: PREEMPT_RT plus an isolated, shielded core plus locked-down power management gets a Linux box to where a 1 kHz loop is solid. Skip any one of the three and your worst case will eventually bite you.
The hardware split: MCU vs SoC/SBC
Here is the design decision that organizes everything else: what runs on the microcontroller and what runs on the application processor? The answer follows directly from real-time class.
The hard real-time, kHz, simple, bounded-WCET work goes on a microcontroller or smart drive: an STM32 (Cortex-M), a TI C2000 (purpose-built for motor control, with its trig accelerator and high-res PWM), or an FPGA for the extreme cases. These chips have deterministic interrupt latency, no MMU games, no OS to fight, and direct hardware control of PWM and ADC sampling synchronized to the switching edge. A C2000 doing a 20 kHz FOC loop has jitter measured in nanoseconds. You cannot buy that on a Linux SBC at any price, because the architecture works against you.
The soft real-time, compute-heavy, complex work goes on a SoC / SBC: a Jetson (Orin or Thor), a Raspberry Pi 5, an x86 box, an i.MX8. These run Linux, have gigabytes of RAM, GPUs for perception, full networking, and the development convenience of a real OS. They run kinematics, planning, perception, state estimation, and the supervisory control layer.
Rule: Put hard real-time where determinism is cheap (the MCU). Put complexity where determinism is expensive but compute is cheap (the SBC). Never invert this.
Jetson + MCU co-design
The canonical robot brain is a co-designed pair: a Jetson for perception and high-level control, plus one or more microcontrollers or smart drives for the actual loops, connected by EtherCAT, CAN/CANopen, or a custom SPI/UART link. The Jetson never closes a torque loop. It sends setpoints — joint targets, Cartesian goals, gait parameters — at 100 Hz to 1 kHz, and the MCUs turn those into the kHz current commands that move the motors.
This is exactly how modern humanoids and quadrupeds are built. See the humanoid robot hardware guide and the legged quadruped robot hardware guide: a central compute (often Jetson Orin/Thor class) runs the whole-body controller and perception at a few hundred Hz to ~1 kHz, while each leg/joint actuator embeds its own MCU running the current loop at 20–40 kHz. The high-level brain can stutter for a few milliseconds during a perception spike and the robot stays upright, because the joint-level loops never miss.
Why not just run everything on the SBC?
People try. They put a 1 kHz loop in a Linux thread, see it mostly works, ship it, and then field a robot that occasionally faults a drive when the WiFi stack does something interesting or a log flush stalls. Even with PREEMPT_RT, the SBC is the less deterministic half of the system, and the higher your loop rate the more its jitter eats your period. The MCU is not a compromise you make for cost — it is the right tool. A $3 STM32 closes a loop more reliably than a $2000 GPU board, and that is not changing. The SBC side keeps getting more capable, so the loops that can reasonably live on Linux creep upward, but the bottom of the pyramid — the current loop tied to PWM switching at tens of kHz — is not moving off silicon.
Real-time fieldbuses
Once you have multiple smart drives and an SBC, they have to talk — deterministically. A standard Ethernet switch with TCP/IP is hopeless for this: variable latency, retransmissions, no synchronization. Real-time fieldbuses solve it. The deep dive on industrial networking lives in the industrial automation, PLC, SCADA & fieldbus guide; here is the control engineer's view.
EtherCAT and the distributed-clock trick
EtherCAT won motion control, and it won for two reasons: processing-on-the-fly and distributed clocks.
Processing on the fly means the master sends one Ethernet frame that travels down the daisy-chain of slaves, and each slave reads its outputs and writes its inputs as the frame passes through its hardware, with nanosecond-scale delay, then forwards it on. One frame services the entire network. There is no per-slave round trip. This is why EtherCAT can service 100 axes in roughly 100 µs and sustain cycle times of 50 µs–1 ms across a real machine.
Distributed clocks (DC) are the genuinely clever part. One slave's clock is the reference, and the master measures the propagation delay to every other slave (down to nanoseconds) and continuously disciplines every slave's local clock to the reference. The result: all slaves share a common time base synchronized to well under 1 µs of skew — often < 100 ns. Each drive then latches its actuator command and samples its feedback at the same instant network-wide, triggered by a DC sync interrupt rather than by frame arrival. That removes the jitter of the communication itself from the control timing. Two motors on opposite ends of a 30-node chain step in lockstep.
Cycle-time math for sizing a network:
Per-axis EtherCAT process data: ~12 bytes (e.g. CiA 402: control word,
target position, status word, actual position).
Ethernet frame overhead : 38 bytes (preamble, SOF, header, CRC, IFG)
EtherCAT frame header : 12 bytes
Per-slave datagram overhead: ~12 bytes
6 axes x (12 data + 12 overhead) = 144 bytes of process data
Total frame ~ 144 + 50 = 194 bytes -> ~1552 bits
At 100 Mbit/s: 1552 bits / 100e6 = 15.5 us on the wire
Add ~1 us per slave forwarding delay x 6 = 6 us
Total bus time ~ 22 us -> fits comfortably in a 250 us (4 kHz) cycle
That headroom is why a single EtherCAT master on an SBC can run a 1–4 kHz process-data cycle to a dozen drives with margin to spare. Common open masters: IgH EtherCAT Master (EtherLab), SOEM (Simple Open EtherCAT Master, great for embedded), and the EtherCAT support inside ros2_control.
CANopen and CAN
CANopen (CiA 301 application layer, CiA 402 for drives) runs over CAN at up to 1 Mbit/s (or a few Mbit/s with CAN FD). It is slower than EtherCAT and shares one bus among all nodes, so it is event- and priority-arbitrated rather than cyclically scheduled. Realistic deterministic cycle times are 1–10 ms for a handful of axes. CANopen still dominates cost-sensitive servo and industrial applications, and CiA 402 is the lingua franca of drive profiles — even many EtherCAT drives speak CiA 402 over EtherCAT (CoE). For robots with modest axis counts and rates, plain CAN/CANopen is often plenty, and the wiring is dead simple.
The quick comparison
| Bus | Sync / cycle | Determinism | Topology | Best for |
|---|---|---|---|---|
| EtherCAT | 50 µs–1 ms, DC < 1 µs skew | Excellent | Daisy chain / ring | High-axis-count, high-rate motion control |
| CANopen | 1–10 ms typical | Good (arbitrated) | Bus | Cost-sensitive servo, moderate rates |
| EtherNet/IP (CIP Sync) | ~1 ms+ | Good with PTP | Star | Factory/PLC ecosystems |
| PROFINET IRT | 250 µs–1 ms | Excellent (IRT) | Line/star | Siemens/PLC ecosystems |
| SERCOS III | 31.25 µs–1 ms | Excellent | Ring | High-end CNC/motion |
For a robot built from scratch, the choice is usually EtherCAT (if you need many fast axes or want sub-microsecond sync) or CANopen (if you want cheap and simple at lower rates). The PLC-ecosystem buses matter when you are integrating into an existing factory line — that is the industrial automation guide's territory.
Writing real-time code
A deterministic OS and bus get you nothing if the code in the hot loop is non-deterministic. Real-time code is a discipline, and the rules are non-negotiable on the hard path.
The forbidden list
Rule: In the real-time path — no dynamic memory, no blocking, no unbounded work, no page faults. Everything the loop touches must have a bounded, known cost.
- No
malloc/free/new/delete. The allocator can take a lock, walk a free list, or call the kernel for more pages — all unbounded. Allocate everything up front, before the loop starts. Use pre-sized pools and ring buffers. - No blocking syscalls. No
printfto a terminal, no file I/O, nosleepother than your loop's timed wait, no socket calls that can block. Logging happens by writing to a lock-free ring buffer that a separate, lower-priority thread drains and writes out. - No unbounded loops or recursion. Every loop must have a compile-time or load-time bound. No "iterate until converged" without a hard iteration cap.
- No page faults. A page fault is a trip to the kernel and possibly to disk — milliseconds. Lock all memory resident with
mlockalland pre-fault your stack and heap. - Bounded WCET. You should be able to state the worst-case execution time of the loop body and show it is comfortably under the period.
Locking memory and setting up the RT thread
The standard setup for a Linux RT control thread:
#include <pthread.h>
#include <sched.h>
#include <sys/mman.h>
#include <string.h>
void setup_rt_thread(void) {
/* 1. Lock all current and future memory; no page faults in the loop. */
if (mlockall(MCL_CURRENT | MCL_FUTURE) != 0)
perror("mlockall");
/* 2. Pre-fault and lock the stack so the first deep call doesn't fault. */
const size_t STACK = 512 * 1024; /* 512 KB */
unsigned char dummy[STACK];
memset(dummy, 0, STACK); /* touch every page now */
/* 3. Pin to an isolated core (e.g. core 3 from isolcpus=3). */
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(3, &set);
pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
/* 4. SCHED_FIFO with a high (but not max) priority. */
struct sched_param sp = { .sched_priority = 80 };
if (pthread_setschedparam(pthread_self(), SCHED_FIFO, &sp) != 0)
perror("setschedparam");
}
SCHED_FIFO is a fixed-priority, run-to-completion-or-preemption policy: the highest-priority ready FIFO task runs until it blocks or yields. Use a high priority for the control loop but leave headroom (do not use 99) so that critical kernel threads and the watchdog can still run above you. SCHED_DEADLINE (EDF-based) is an alternative when you want the kernel to admission-control and guarantee a runtime within a period — elegant for well-characterized loops, though SCHED_FIFO with an isolated core remains the common workhorse.
The control-loop skeleton
A correct periodic loop uses absolute-time waits (clock_nanosleep with TIMER_ABSTIME), not relative sleeps, so that the time spent computing does not accumulate as drift:
#define PERIOD_NS 1000000L /* 1 ms -> 1 kHz */
void control_loop(void) {
struct timespec next;
clock_gettime(CLOCK_MONOTONIC, &next);
uint64_t overruns = 0;
for (;;) {
/* --- hard real-time work: bounded, no alloc, no blocking --- */
read_feedback(); /* fetch latest sensor data */
update_controllers(); /* PID / state-space update */
write_commands(); /* push setpoints to drives */
/* ------------------------------------------------------------ */
next.tv_nsec += PERIOD_NS;
while (next.tv_nsec >= 1000000000L) { next.tv_nsec -= 1000000000L; next.tv_sec++; }
/* Detect overruns: did we already pass the next deadline? */
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
if (now.tv_sec > next.tv_sec ||
(now.tv_sec == next.tv_sec && now.tv_nsec > next.tv_nsec)) {
overruns++; /* log it OUT of band; never printf here */
}
clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &next, NULL);
}
}
Two details matter. First, the absolute-deadline next accumulates the period, so jitter in any one iteration does not drift the long-term rate. Second, the overrun check is how you know the system is healthy — count missed deadlines and surface that count (via shared memory to a logging thread), because an undetected overrun is how robots silently degrade.
Priority inversion and priority inheritance
The classic real-time bug: a high-priority task (H) needs a mutex held by a low-priority task (L); meanwhile a medium-priority task (M) preempts L. Now M, which H does not even depend on, runs ahead of H — H is blocked indefinitely by something lower than it. This is priority inversion, and it is famously what stalled the Mars Pathfinder rover in 1997.
The fix is priority inheritance: while L holds a mutex that H wants, L temporarily inherits H's priority, so M cannot preempt it; L finishes fast, releases the mutex, and H proceeds. On Linux, set PTHREAD_PRIO_INHERIT on the mutex attribute. Better still on the hot path: avoid shared locks entirely. Use lock-free single-producer/single-consumer ring buffers and double-buffering to pass data between the RT loop and the rest of the system, and the inversion problem never arises.
Rule: If you must share data with the RT loop, prefer lock-free structures. If you must use a mutex, make it priority-inheriting and hold it for the minimum bounded time.
ros2_control and real-time ROS 2
ROS 2 is the dominant robotics middleware, and a fair question is whether you can do real-time control in it. The honest answer: ROS 2 nodes are not hard real-time, but you can run a hard-real-time loop inside the ROS 2 process if you are disciplined. The full ROS 2 picture is in the ROS 2 guide; here is the real-time slice.
Why plain ROS 2 nodes are not hard real-time
A normal ROS 2 callback runs in an executor, scheduled cooperatively, on top of DDS for transport. DDS does discovery, serialization, and possibly network I/O, and the default executor and allocator are not real-time. Message latency through DDS is typically good (tens to hundreds of microseconds intra-host) but not bounded — and the callback-based model gives you no hard deadline guarantee. So you never put a 1 kHz torque loop in a stock rclcpp subscription callback and expect determinism.
How ros2_control does it
ros2_control solves this with a clear separation. The controller manager runs a dedicated update() loop — typically on its own thread, which you configure as SCHED_FIFO on an isolated core, with mlockall called at startup. That loop:
- Calls
read()on the hardware interface (e.g. an EtherCAT or CAN driver) to pull fresh joint states. - Calls
update()on each loaded controller (joint trajectory, admittance, etc.) — your control math. - Calls
write()to push commands back to the hardware.
This read → update → write cycle is the real-time heart, and as long as your hardware interface and controllers obey the real-time rules (no allocation, no blocking, no DDS on the hot path), it runs deterministically at 1–4 kHz. The trick is that this loop does not go through DDS for its inner work — it talks directly to the hardware interface in-process. ROS topics, services, and parameters are used for the non-real-time edges: loading controllers, sending goals, publishing state at a lower rate. Those cross the executor and DDS; the control loop does not.
Rule: In a
ros2_controlsystem, keep DDS, the parameter server, dynamic allocation, and logging out of theread/update/writepath. Pass data to the ROS world through real-time-safe buffers (e.g.realtime_tools::RealtimeBuffer/RealtimePublisher).
In practice this works well. A ros2_control controller manager on PREEMPT_RT, pinned to a shielded core, talking EtherCAT to the drives, comfortably holds a 1 kHz loop with the jitter dominated by the bus DC sync (sub-µs) rather than by ROS. The middleware overhead (DDS discovery, the executor) lives entirely outside the loop. For the loops that need to be faster — current and velocity — those are still down on the drives, and ros2_control just commands them.
Time sync and multi-rate coordination
In a single-MCU robot, "time" is just the timer peripheral and life is simple. The moment you have multiple compute nodes, multiple sensors, and a fieldbus, you have a distributed-clock problem: every device has its own oscillator, and oscillators drift. Without synchronization, a timestamp from the camera, an encoder reading from a drive, and an IMU sample are each in their own time frame, and your sensor fusion is fusing apples measured at unknown moments.
Why timestamping matters
State estimators (Kalman filters, factor graphs) assume they know when each measurement was taken, relative to the others, to sub-millisecond precision. If your IMU sample is timestamped when ROS received it rather than when the IMU sampled it, you have injected the entire transport-and-scheduling latency — and its jitter — into your estimate. For a fast-moving robot that is the difference between a crisp state estimate and a smeary, lagging one. Timestamp at the source. The sensor-side detail is in the robot sensors guide.
Rule: Timestamp every measurement as close to the physical sampling instant as possible — ideally in hardware, latched by a sync signal — never at software receive time.
PTP / IEEE 1588
Precision Time Protocol (IEEE 1588, PTP) synchronizes clocks across an Ethernet network to sub-microsecond accuracy when the NICs and switches do hardware timestamping (PTP-aware hardware), and to tens of microseconds in software-only mode. One node is the grandmaster; the rest discipline their clocks to it by exchanging timestamped sync messages and measuring the path delay. On a multi-computer robot — say a Jetson plus a perception PC plus smart cameras — PTP is how you get them all onto one clock so their timestamps are directly comparable. Linux exposes this through the linuxptp stack (ptp4l, phc2sys) and the PTP_HW clock.
EtherCAT distributed clocks, revisited
On the EtherCAT bus, distributed clocks (covered above) already give you sub-microsecond synchronization for free across all drives, and the DC sync signal can also trigger synchronized sampling. A common clean design: the EtherCAT DC reference is the master clock, and PTP disciplines the SBC's system clock to it, so the whole robot — Ethernet side and EtherCAT side — shares one time base. Then a camera frame, an EtherCAT-attached force sensor, and a joint encoder reading are all comparable to within a microsecond or two.
Coordinating the rates
Multi-rate coordination is mostly about clean hand-offs and consistent snapshots. The fast loop publishes its state into a buffer; the slow loop reads a coherent snapshot of that buffer at its own rate. You never have the slow loop wait on the fast loop or vice versa — they are decoupled through buffers, each running on its own clock-disciplined cadence. When the planner (10 Hz) produces a new trajectory, it hands the whole trajectory to the 1 kHz tracker, which interpolates between waypoints at its own rate. The planner being occasionally late just means the tracker keeps following the previous trajectory a moment longer — exactly the jitter tolerance the hierarchy is designed to provide.
Designing and validating a real-time system
Now we put it together. Here is how I approach a real-time robot controller from a blank sheet.
Step 1: Classify every loop
List every control loop and feedback path. For each, write down: required rate, real-time class (hard/firm/soft), what it reads, what it commands, and the cost of a missed deadline. This single table dictates your architecture. Anything hard at > a few kHz is going on an MCU or drive. Anything hard at ≤ 4 kHz can go on PREEMPT_RT if you must. Everything soft goes on Linux without ceremony.
Step 2: Choose rates with the 5–10× rule
Pick rates so each loop is 5–10× faster than the one it serves, and so each loop's rate is comfortably above the bandwidth it must control (rule of thumb: sample at least 10–20× the closed-loop bandwidth). Do not over-rate: a 10 kHz loop on a Linux SBC is asking for trouble when 1 kHz would do, and every extra kHz costs determinism margin.
Step 3: Assign hardware and bus
Apply the split. MCUs/drives for the hard fast loops, SBC (PREEMPT_RT) for the firm/soft control, GPU for perception. Pick the fieldbus from the axis count and rate: EtherCAT for many fast axes with tight sync, CANopen for fewer/slower. Plan the time-sync scheme (DC + PTP) up front, not after.
Step 4: Budget the latency
For each hard loop, write a latency budget that sums to less than the period with margin. A 1 ms (1 kHz) example:
| Item | Budget |
|---|---|
| Scheduling jitter (cyclictest Max + margin) | 50 µs |
| EtherCAT bus cycle (read feedback) | 25 µs |
| Control computation (WCET) | 100 µs |
| EtherCAT bus cycle (write command) | 25 µs |
| Slack / margin | ≥ 200 µs |
| Total used | ≤ 800 µs of 1000 µs |
If the sum approaches the period, you do not have a real-time system — you have a system that happens to work until load rises. Keep meaningful slack.
Step 5: Bring-up, in order
- Tune the OS first. PREEMPT_RT kernel,
isolcpus/nohz_full/rcu_nocbson the control core, IRQ affinity off it, C-states disabled, governorperformance, BIOS SMI sources minimized. - Run
cyclictestfor hours under load. Get a realMaxfigure on your hardware. If it is not in the tens of µs, fix the OS/hardware before writing a line of control code. - Bring up the bus. Verify EtherCAT DC sync is locking (skew < 1 µs), check the cycle is meeting its deadline, confirm no lost frames.
- Bring up the loop with logging. Run the
read/update/writeloop, log actual period and overrun count to a ring buffer drained by a non-RT thread. Watch the overrun count under stress. - Stress it. Hammer the box with perception load, network traffic, logging — whatever production looks like — and confirm the overrun count stays at zero and the period histogram stays tight.
Step 6: Validate continuously, in production
Real-time validation is not a one-time bring-up checklist; it is a permanent instrument. Keep counting overruns and logging the loop-period histogram in the shipped robot. A latency regression from a kernel update, a new background service, or a thermal-throttling event will show up as overruns long before it shows up as a fault — if you are watching.
The checklist
Real-time design checklist
- Every loop classified hard/firm/soft with a missed-deadline cost.
- Hard fast loops on MCU/drive; soft/complex on SBC.
- Rates follow the 5–10× cascade rule and ≥ 10× the controlled bandwidth.
- PREEMPT_RT kernel, isolated + shielded control core, C-states off, governor
performance.cyclictestMax measured under load, for hours, on the real hardware.- Latency budget written and summing to < period with margin.
- RT thread:
SCHED_FIFO, pinned,mlockall, no alloc/blocking/page faults in the loop.- Lock-free or priority-inheriting data sharing; DDS/logging off the hot path.
- Fieldbus DC/PTP sync verified < 1 µs; no lost frames under load.
- Sensors timestamped at the source.
- Overrun counter and period histogram logged in production.
Do all of that and you have a robot whose control loop closes on time, every time — which, as we said at the top, is the entire job.
Frequently asked questions
Is real-time the same as fast / low-latency? No, and conflating them is the root of most real-time mistakes. Fast means low average latency or high throughput. Real-time means bounded worst-case latency — meeting a deadline every single time. A slow-but-predictable system is real-time; a fast-but-occasionally-stalling system is not. A 200 MHz MCU with 1 µs jitter is a better real-time controller than a 4 GHz CPU with millisecond jitter.
Can I run a hard real-time control loop on a Raspberry Pi or Jetson with Linux? You can run a firm loop up to about 1 kHz, sometimes a bit higher, on PREEMPT_RT with a shielded core, locked memory, and disabled power management — measured worst-case scheduling latency in the tens of microseconds. You should not run a truly hard loop (where a single miss is catastrophic) there, and you cannot run a 20 kHz current loop there at all. Put genuinely hard, fast loops on a microcontroller or smart drive.
Is mainline Linux real-time now that PREEMPT_RT was merged?
The PREEMPT_RT functionality landed in the mainline kernel (6.12, late 2024), so you no longer need an out-of-tree patch — you enable CONFIG_PREEMPT_RT. That makes Linux a strong soft/firm real-time OS, good for 1 kHz loops on tuned hardware. It does not make Linux a hard-real-time MCU replacement; the worst case (tens of µs) is still orders of magnitude looser than a bare-metal Cortex-M (sub-µs).
FreeRTOS or Zephyr for a new MCU project? FreeRTOS if you want the smallest, most ubiquitous, dead-simple kernel and your needs are mostly "a few tasks plus a control loop." Zephyr if you want a modern driver model, built-in networking/connectivity, devicetree-based configuration, and room to grow — at the cost of a steeper learning curve. Both are hard real-time capable; on an MCU your interrupt latency and cache configuration usually matter more than the kernel choice.
Why did EtherCAT win over other industrial Ethernet buses for robots? Two reasons: processing-on-the-fly (one frame services the whole daisy-chain in tens of microseconds, no per-node round trips) and distributed clocks (every slave synchronized to under 1 µs of skew, so all axes act at the same instant). That combination gives sub-millisecond, low-jitter, multi-axis coordination that is hard to beat. CANopen is still excellent for cheaper, lower-rate systems, and CiA 402 drive profiles are shared across both.
What exactly is jitter and why does it matter more than latency? Jitter is the cycle-to-cycle variation in your loop timing. A constant latency you can compensate for in your control design; jitter you cannot, because it makes your effective sample period — and therefore your gains, integrator, and derivative term — wander unpredictably. Enough jitter erodes phase margin and can destabilize a well-tuned loop. Always design and validate against worst-case jitter, not average latency.
How do I actually measure my system's real-time performance?
Run cyclictest (from rt-tests) for hours, under realistic load (use stress-ng), pinned to your control core at your loop's RT priority, and read the Max latency. That is your scheduling jitter. Then instrument your real loop to count overruns and log a period histogram in production. Numbers from an idle machine, or from a five-minute run, are not trustworthy — worst-case behavior hides in the tail.
Why can't I use malloc or printf in a control loop?
Both are unbounded in time. malloc/free can take a lock, walk a free list, or syscall to the kernel for more pages — any of which can stall for an unknown duration and possibly cause a page fault. printf does formatting and blocking I/O. Anything with unbounded or unknown worst-case time destroys determinism. Pre-allocate everything, lock memory with mlockall, and pass log data out through a lock-free ring buffer that a separate low-priority thread drains.
What is priority inversion and how do I prevent it?
Priority inversion is when a high-priority task is blocked waiting on a resource (mutex) held by a low-priority task, while a medium-priority task preempts the low one — so the medium task effectively runs ahead of the high one. It stalled the Mars Pathfinder rover in 1997. Prevent it with priority-inheritance mutexes (PTHREAD_PRIO_INHERIT), so the lock holder temporarily inherits the waiter's priority. Better: avoid shared locks on the hot path entirely using lock-free buffers.
Can ROS 2 do real-time control?
ROS 2 nodes are not hard real-time — the executor and DDS transport are not bounded. But you can run a hard-or-firm real-time loop inside a ROS 2 process using ros2_control: its controller-manager read → update → write loop runs on a dedicated SCHED_FIFO thread on an isolated core, talking directly to the hardware interface, with DDS, allocation, and logging kept entirely off the hot path. The control loop is deterministic; the ROS messaging around it is not, and that is fine because it handles only the non-real-time edges.
Do I really need to synchronize clocks across my robot's computers? If you fuse data from multiple sensors or coordinate multiple compute nodes, yes. Without sync, each device's timestamps live in its own drifting time frame and your state estimator cannot correctly order or align measurements. Use PTP/IEEE 1588 (sub-microsecond with hardware timestamping) across Ethernet, EtherCAT distributed clocks on the bus, and always timestamp measurements at the physical sampling instant — not at software receive time.
SCHED_FIFO or SCHED_DEADLINE for my RT thread?
SCHED_FIFO (fixed priority, run-to-preemption) on a pinned, isolated core is the common, well-understood workhorse and what most ros2_control setups use. SCHED_DEADLINE (earliest-deadline-first with kernel admission control) is elegant when you can characterize each task's runtime and period precisely and want the kernel to guarantee CPU budget — useful for mixed-criticality scheduling, but more setup. Start with SCHED_FIFO plus core isolation; reach for SCHED_DEADLINE when you need formal budget guarantees across multiple RT tasks.
Related guides
- Stepper Motors & Drivers: The Ultimate Guide
An engineer-grade guide to stepper motors and drivers: how steps and microsteps really work, NEMA frame sizes, the torque-speed curve, resonance and missed steps, A4988 vs Trinamic TMC drivers, closed-loop steppers, and honest sizing math.
- Servo Motors: The Ultimate Guide
A deep, engineer-grade guide to servo motors: RC vs industrial vs smart serial servos, PWM and closed-loop control, datasheet specs, cascaded PID, sizing math, failure modes, and a real-product comparison table.
- Linear Motion Systems: Rails, Ball Screws & Linear Motors — The Ultimate Guide
A working engineer's guide to linear motion: profile rails and recirculating-ball guides, ball/lead/roller screws, belt and rack drives, and linear motors — with preload classes, accuracy grades, life and critical-speed math, real parts, and a selection workflow.
- Brushless DC Motors (BLDC) for Robotics: The Ultimate Guide
A robotics engineer's deep dive into brushless DC motors: Kv vs Kt, trapezoidal vs FOC commutation, sensored vs sensorless, gimbal/QDD actuators, datasheet math, and how to size a BLDC for a robot joint or drone.
- Robot Actuators: Electric, Hydraulic & Pneumatic — The Ultimate Guide
A working engineer's guide to robot actuators — electric, hydraulic, pneumatic, series-elastic, QDD, and soft — with real power/force-density numbers, products, and a selection cheat-sheet.
- Robot Gearboxes: Harmonic & Cycloidal Drives — The Ultimate Guide
A working engineer's guide to robot gear reduction: harmonic (strain-wave), cycloidal RV, and planetary drives compared on ratio, backlash, stiffness, efficiency, shock tolerance, and where each one actually belongs in a robot.