Reinforcement Learning for Robotics: The Ultimate Guide

Around 2019 a quadruped from ETH Zurich learned to walk in a simulator and then walked, on the first try, on real grass. No one hand-tuned a gait. No one wrote a state machine for stance and swing. A neural network mapped joint angles and a body-velocity command to twelve joint targets, and the gait — the whole coordinated mess of contact, recovery, and balance — fell out of an optimization that ran for a few hours on one GPU. That result, and the dozens that followed it, is why every serious legged-robot and humanoid team in 2026 has an RL person, and why a lot of classical-controls people are nervously learning PyTorch.

This guide is the long version for the engineers who actually build these systems: the controls person who wants to know why PPO beats their carefully tuned MPC on rough terrain, the ML person who can train a policy in sim but can't get it to survive contact with a real robot, and the advanced maker who has read the ANYmal papers and wants the recipe. We go end to end: why RL suits contact-rich robotics at all, the MDP fundamentals, the three or four algorithms that actually work on hardware, the massively-parallel sim-to-real pipeline, domain randomization, the reward-hacking trap, imitation learning, the teacher-student recipe that made legged RL reliable, the landmark results, when RL beats classical control and when it absolutely does not, and how you get a trained policy running at 50 Hz on onboard compute without it blowing up.

The take: RL is not a replacement for control theory — it is a compiler that turns a reward function and a good simulator into a reactive feedback policy for problems where you can't write the controller by hand. It wins decisively on contact-rich, hard-to-model, high-dimensional tasks (legged locomotion, dexterous manipulation, whole-body humanoid control) and loses to MPC and trajectory optimization on well-modeled, accuracy-critical, low-dimensional tasks (a 6-axis arm tracing a weld seam). The 2026 frontier is not the algorithm — PPO has barely changed since 2017 — it is the simulator, the randomization, and the sim-to-real bridge. Get those wrong and the fanciest algorithm gives you a policy that walks beautifully in sim and falls over on the floor.

Companion reading: robot simulation & digital twins, legged & quadruped robot hardware, humanoid robot hardware, and real-time control systems.

Key takeaways
Why RL for robots at all
RL fundamentals: the MDP, reward, policy, value, return
The algorithm landscape: model-free vs model-based, on- vs off-policy
The algorithms that actually work on robots
The sim-to-real pipeline
Domain & dynamics randomization
Reward shaping and the reward-hacking trap
Imitation learning: BC, DAgger, and how it complements RL
Teacher-student & privileged learning
Landmark results: legged, dexterous, humanoid
Learned vs classical control
Deploying a policy
On-robot fine-tuning, safety & limitations
Data & compute budget
Frequently asked questions

Key takeaways

RL earns its keep on contact-rich, hard-to-model, high-dimensional problems. Legged locomotion, dexterous in-hand manipulation, and whole-body humanoid control all involve discontinuous contact dynamics that are painful to model and to control analytically. RL learns a reactive policy directly from simulated experience and sidesteps the modeling problem.
PPO dominates parallel-sim locomotion not because it is the most sample-efficient algorithm but because it is the most robust one. It tolerates bad hyperparameters, scales cleanly to tens of thousands of parallel environments, and rarely diverges. On a single GPU running Isaac Lab you can collect billions of simulated steps in hours, so PPO's sample-hunger stops mattering.
SAC and TD3 are the sample-efficient off-policy alternatives for continuous control. Use them when environment steps are expensive — single-environment sim, or real-robot fine-tuning — where PPO's appetite for fresh on-policy data is fatal. SAC's entropy regularization makes it the safer default of the two.
The simulator is the product. Sim-to-real transfer succeeds or fails on simulator fidelity, randomization, and the observation design — not on the RL algorithm. Teams obsess over PPO clip ratios when the real bug is an actuator model that ignores motor delay. See robot simulation & digital twins.
Massively parallel simulation changed the economics. Isaac Gym and now Isaac Lab run thousands of robot instances on a single GPU at hundreds of thousands of steps per second. A legged-locomotion policy that took days on CPU clusters in 2018 trains in well under an hour in 2026.
Domain randomization is the bridge, not a trick. Randomize masses, friction, latency, motor gains, terrain, and sensor noise during training and the policy learns a controller robust to the distribution of plausible real robots — which includes the actual one. Randomize too little and it overfits sim; too much and it learns nothing.
Reward hacking is the default failure mode, not an edge case. Any exploitable gap between what you reward and what you want, the optimizer will find. Budget more time for reward debugging than for algorithm tuning.
Teacher-student / privileged learning is the legged-robot recipe. Train a teacher with access to privileged state (true friction, contact forces, terrain height) it could never measure on hardware, then distill it into a student that uses only onboard sensors and a short history. This decouples "learn the skill" from "learn to perceive."
Imitation learning complements RL; it rarely replaces it. Behavior cloning gives you a warm start or a reference style; DAgger fixes the compounding-error problem of pure BC; RL then optimizes for robustness and performance the demonstrations never showed.
RL beats MPC when the model is bad or the contacts are many; MPC beats RL when the model is good and accuracy is contractual. Don't put a learned policy on a robot tracing a weld seam to 0.1 mm. Don't put MPC alone on a quadruped sprinting over rubble. See motion planning & kinematics.
Deployment is an embedded-systems problem. Export to ONNX, optionally compile with TensorRT, run inference at the control rate (50–100 Hz for locomotion, up to 1 kHz for some manipulation), and respect the real-time control loop. A 2-layer MLP policy runs in tens of microseconds on a modern CPU; you do not need a GPU on the robot for most locomotion policies.
On-robot fine-tuning is risky and usually unnecessary. Most 2026 production stacks train fully in sim and deploy frozen. If you must fine-tune on hardware, fence it with hard safety limits and an outer classical controller that can take over.
Compute budgets are modest by LLM standards. A flagship legged-locomotion policy is a few-million-parameter MLP trained for a few GPU-hours; a dexterous-manipulation policy might be a few-GPU-days. The expensive resource is engineer time spent on reward and sim fidelity, not FLOPs.

Why RL for robots at all

Classical control is extraordinary at what it does well. Give a controls engineer a well-modeled, low-dimensional, smooth system — a motor, a drone, a 6-axis arm in free space — and they will give you a controller with provable stability, predictable behavior, and microsecond latency. For those problems, reaching for RL is engineering malpractice. It is slower to develop, harder to certify, and worse on the metrics that matter.

RL earns its place where three conditions stack up.

Contact is everywhere and it's discontinuous. A foot striking the ground, a finger rolling a cube, a hand wedging a peg into a hole — contact makes the dynamics hybrid and non-smooth. The equations of motion switch as contacts make and break, friction cones clip forces, and small state changes flip the system between regimes. Gradient-based controllers built on a single smooth model struggle; the model is right only between contact events. RL doesn't need a unified analytic model — it learns from rollouts that already contain all the contact transitions.

The dynamics are hard to model accurately. Series-elastic and quasi-direct-drive actuators have their own dynamics. Cables stretch, gears have backlash and friction, soft feet deform, payloads shift. You can spend a year identifying a model that's still wrong by 20%. RL with randomization learns a policy robust to a family of models, which is more honest about how poorly you actually know the robot.

The dimensionality and the desired behavior are high. A humanoid has 25-50 actuated joints. The behavior you want — walk, recover from a shove, climb stairs, carry a box — is not a setpoint; it's an emergent, context-dependent coordination of all those joints. Writing that by hand is a state-machine nightmare. RL produces emergent gaits: behaviors no one specified, discovered because they maximize reward. The ANYmal trot, gallop, and recovery behaviors were never coded — they emerged.

Rule: Choose RL when you cannot write the controller by hand and you can write a reward and build a good-enough simulator. If either of those is false, RL is the wrong tool.

The flip side: RL gives up the things classical control gives you for free — stability guarantees, interpretability, sample-free design, and tight accuracy. You trade analyzability for the ability to solve problems analysis can't reach. On a quadruped scrambling over rubble that's a great trade. On a precision arm it's a terrible one.

RL fundamentals: the MDP, reward, policy, value, return

Strip away the deep-learning machinery and RL is the theory of sequential decision-making under uncertainty. The frame is the Markov Decision Process (MDP): states s, actions a, a transition model P(s'|s,a), a reward function r(s,a), and a discount factor γ ∈ [0,1).

For a robot, the state is whatever the policy gets to see — joint positions and velocities, base orientation and angular velocity from the IMU, the velocity command, maybe a history of past observations and actions, maybe an exteroceptive terrain map. The action is the policy output — almost always target joint positions for a downstream PD controller, sometimes torques directly (rarely, because it's harder to make safe). The transition is the simulator's physics step. The reward is your scalar encoding of "good behavior."

The agent's goal is to maximize expected return — the discounted sum of future reward:

G_t = Σ_{k=0}^{∞} γ^k · r_{t+k}

Objective:  J(π) = E_{τ~π} [ Σ_t γ^t · r(s_t, a_t) ]

The discount γ (typically 0.99 for locomotion, meaning a horizon of roughly 1/(1−γ) = 100 steps) trades immediate against future reward and keeps the infinite sum finite.

Three functions carry all the weight:

Policy π(a|s) — the controller. Maps state to a distribution over actions. For continuous control it's usually a Gaussian whose mean is a neural network output and whose variance is a learned (often state-independent) parameter. At deployment you take the mean — deterministic.
State-value V^π(s) — expected return from state s under policy π. "How good is it to be here?"
Action-value Q^π(s,a) — expected return from taking action a in state s, then following π. "How good is this move?"

The advantage A(s,a) = Q(s,a) − V(s) measures how much better an action is than the policy's average — the single most useful quantity in policy-gradient methods, because it tells you which actions to make more or less likely:

A^π(s_t, a_t) = Q^π(s_t, a_t) − V^π(s_t)

# In practice, estimated with Generalized Advantage Estimation (GAE):
δ_t   = r_t + γ·V(s_{t+1}) − V(s_t)            # TD residual
Â_t   = Σ_{l=0}^{∞} (γλ)^l · δ_{t+l}           # λ ∈ [0,1] trades bias vs variance

GAE's λ (commonly 0.95) is the bias-variance knob: λ=0 is low-variance, high-bias one-step TD; λ=1 is high-variance, unbiased Monte Carlo. Most locomotion recipes sit at 0.95.

Rule: If your robot's observation isn't Markov — if the optimal action depends on history the policy can't see — either add a short observation history (stack the last N frames) or use a recurrent policy. A purely reactive MLP on a non-Markov observation will plateau and you'll blame the algorithm.

The Markov assumption is where real robots bite you. Motor delay, sensor latency, and unobserved terrain all break the clean MDP. The standard fixes — stacking observation history, adding an actuator-delay model in sim, or using the teacher-student trick below — are really about restoring enough state that the problem becomes Markov again.

The algorithm landscape: model-free vs model-based, on- vs off-policy

Two axes organize the whole field, and knowing where an algorithm sits tells you most of what you need.

Model-free vs model-based. Model-free methods (PPO, SAC, TD3, DDPG) learn a policy and/or value function directly from experience without ever learning the transition dynamics. Model-based methods (Dreamer, MBPO, TD-MPC2) learn a model of the world and plan or generate imagined rollouts inside it. Model-based methods are far more sample-efficient — they squeeze more learning from each real interaction — which matters enormously if your data comes from a real robot. But when you have a fast simulator, the data is nearly free, and the extra complexity and instability of learning a model often isn't worth it. In 2026, simulated robotics is overwhelmingly model-free; real-world-only learning is where model-based methods shine.

On-policy vs off-policy. On-policy methods (PPO, A2C, TRPO) can only learn from data collected by the current policy; after each update the old data is stale and discarded. Off-policy methods (SAC, TD3, DDPG, Q-learning) learn from a replay buffer of past experience, including data from old policies. Off-policy is dramatically more sample-efficient because every transition can be reused many times. On-policy is more stable and parallelizes beautifully.

The practical consequence is the central trade of the field:

Cheap data (massively parallel sim): use on-policy PPO. Sample inefficiency is irrelevant when you generate 200,000 steps/second.
Expensive data (single sim, real robot, slow sim): use off-policy SAC or TD3. You can't afford to throw experience away.

This is why nearly every published legged-locomotion result uses PPO and nearly every sample-efficiency benchmark and real-robot-learning paper uses SAC. They're solving the same RL problem under opposite data economics.

The algorithms that actually work on robots

Four model-free continuous-control algorithms cover ~95% of real robotics RL. Their lineage matters: DDPG begat TD3 begat the off-policy family; TRPO begat PPO. Here's the comparison that I'd tape to the wall.

Algorithm	Type	Sample efficiency	Stability / robustness	Parallelism	Best use on robots
PPO	On-policy, model-free	Low (needs lots of steps)	High — very forgiving	Excellent (10k+ envs)	Locomotion, humanoid, anything in massively parallel sim
SAC	Off-policy, model-free	High	High (entropy-regularized)	Moderate	Sample-limited continuous control, real-robot fine-tune, manipulation
TD3	Off-policy, model-free	High	Medium (tuning-sensitive)	Moderate	Sample-limited deterministic control where SAC's entropy isn't wanted
DDPG	Off-policy, model-free	Medium	Low — brittle	Moderate	Mostly historical; use TD3 or SAC instead

PPO — why it dominates parallel-sim locomotion

Proximal Policy Optimization is a policy-gradient method that improves the policy while preventing each update from changing it too much. The "proximal" part is a clipped surrogate objective: it computes the ratio between the new and old policy probabilities for each action and clips it to [1−ε, 1+ε] (ε ≈ 0.2), so a single update can't lurch the policy into a region where the advantage estimates are no longer valid.

r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)          # probability ratio
L_CLIP(θ) = E_t [ min( r_t(θ)·Â_t,
                       clip(r_t(θ), 1−ε, 1+ε)·Â_t ) ]

That clipping is the whole reason PPO dominates. It makes the algorithm robust to bad hyperparameters and large, noisy advantage estimates — exactly the conditions you get when you run 4,096 parallel environments and dump a giant heterogeneous batch into one update. PPO almost never diverges, which on a project where a training run costs real wall-clock time is worth more than theoretical sample efficiency.

The pairing with massively parallel sim is the key insight. PPO is on-policy and sample-hungry, which sounds disqualifying — until you note that a single RTX-class GPU running Isaac Lab can step tens of thousands of robots simultaneously. The sample-inefficiency that kills PPO on a real robot is a non-issue when the simulator hands you billions of steps for free. The ETH legged-locomotion line of work, the Isaac Gym ANYmal results, and most Unitree and humanoid locomotion policies are PPO. It is, frankly, boring and reliable, and that's the point.

SAC and TD3 — sample-efficient continuous control

When environment steps are expensive, you switch to off-policy methods that learn from a replay buffer.

Soft Actor-Critic (SAC) adds an entropy bonus to the objective — it maximizes reward and policy randomness — which drives systematic exploration and makes the algorithm remarkably robust to hyperparameters. It learns two Q-functions (taking the min to fight overestimation), a stochastic policy, and auto-tunes the entropy temperature. SAC is my default for anything sample-limited: manipulation in a single sim, real-robot learning, or fine-tuning. It tolerates a wide range of reward scales and tends to "just work."

Twin Delayed DDPG (TD3) is the deterministic-policy counterpart. It fixes DDPG's notorious Q-value overestimation with three tricks: twin critics (take the min), delayed policy updates (update the policy less often than the critics), and target-policy smoothing (add noise to target actions). TD3 is excellent and slightly more sample-efficient than SAC on some tasks, but it's more sensitive to exploration-noise tuning because its policy is deterministic. Choose TD3 over SAC when you specifically want a deterministic policy and you're willing to tune the exploration noise.

DDPG is the ancestor. It works, but it's brittle and easy to destabilize; in 2026 there's no reason to start a project on DDPG when TD3 and SAC exist.

Rule of thumb: Parallel sim → PPO. Sample-limited → SAC (default) or TD3. Real-robot-only with no sim → consider model-based (Dreamer/TD-MPC2) or SAC with a very small step budget. If you're unsure, start with PPO in sim; it's the one most likely to give you a working policy on the first serious attempt.

The sim-to-real pipeline

Almost no successful robot RL in 2026 learns on the real robot. The data is too slow, too expensive, and too dangerous to collect. The dominant paradigm is train in simulation, transfer to reality — and the engineering is mostly in the transfer.

The pipeline, end to end:

Build the digital twin. Accurate URDF/MJCF, mass and inertia from CAD, joint limits, and — critically — an actuator model that captures the real motor's torque-speed curve, delay, and PD behavior. This is the single highest-leverage step. See robot simulation & digital twins.
Massively parallel rollout. Spin up thousands of randomized environment instances on GPU (Isaac Lab / Isaac Gym, MuJoCo MJX, or Brax). Collect experience at hundreds of thousands of steps per second.
Train the policy with PPO (typically), with domain randomization active from step one.
Validate in sim across held-out randomization ranges and edge cases the training distribution didn't emphasize.
Export and deploy the frozen policy (ONNX → optionally TensorRT) onto onboard compute, running at the control rate.
Close the loop on hardware — log everything, compare sim vs real trajectories, and feed the gap back into the simulator (the "real-to-sim" correction).

# Wall-time intuition for a legged-locomotion PPO run.
# Target total experience:        ~2e9 simulation steps  (2 billion)
# Throughput (Isaac Lab, 1 GPU):  ~2e5 steps / second     (4096 envs)
#
#   wall_time = 2e9 / 2e5  =  1e4 seconds  ≈  2.8 hours
#
# Manipulation with a smaller env count (~512) and heavier sim
# might run 1e4 steps/s -> 2e9 / 1e4 = 2e5 s ≈ 2.3 days.
# Throughput, not algorithm choice, sets your wall clock.

The numbers are the headline: the same locomotion task that took days on 2018-era CPU clusters now trains in a couple of hours on one GPU, because Isaac Gym/Lab moved the entire RL loop — physics, observation assembly, reward, and policy inference — onto the GPU and eliminated the CPU-GPU transfer bottleneck that capped earlier frameworks.

Rule: Spend your first week on the actuator model and the observation design, not on the algorithm. A policy trained against an actuator model that ignores motor delay will oscillate or fall on the real robot no matter how good your PPO config is.

Domain & dynamics randomization

The reason a sim-trained policy survives reality is that you never trained it on the simulation — you trained it on a distribution of simulations. Domain randomization (DR) perturbs the simulator's parameters every episode so the policy must work across a range of conditions. If the real robot's true parameters fall inside that range, the policy treats reality as just another sample it has already seen.

There are two flavors. Dynamics randomization perturbs physics — masses, friction, motor gains, latency. Visual domain randomization perturbs the appearance for vision-based policies — textures, lighting, camera pose. Legged locomotion leans on the former; vision-based manipulation needs both.

Technique	What it randomizes	Why it bridges the gap	Typical range
Mass / inertia DR	Link masses, payload, CoM offset	Robot's real mass is never exactly the CAD value; payloads vary	±10–30% of nominal
Friction DR	Ground & joint friction coefficients	Surfaces and joints differ wildly; the biggest sim-real gap for feet	0.4–1.25 (foot-ground μ)
Actuator / motor-gain DR	PD gains, torque limits, motor strength	Real gains drift; gearboxes lose efficiency over time	±10–25%
Latency / delay DR	Observation and action delay	Real control loops have 1–20 ms latency sim ignores by default	0–40 ms
Sensor-noise DR	IMU drift/noise, joint-encoder noise	Real sensors are noisy and biased	Gaussian, robot-specific σ
Push / disturbance injection	Random external forces on the base	Teaches recovery; produces robust balance	impulses every few seconds
Terrain randomization	Slopes, stairs, gaps, roughness (curriculum)	Generalizes locomotion beyond flat ground	progressive difficulty
Visual DR	Textures, lighting, distractors, camera pose	Closes the appearance gap for vision policies	wide, task-dependent

The failure modes sit at both extremes. Too little randomization and the policy overfits to the simulator's quirks — it exploits a friction value or contact model that doesn't exist in reality and falls over on the real floor. Too much randomization and the policy can't find any behavior that works across the whole insane range, so it learns a timid, conservative, low-performance controller — or nothing at all. Tuning the ranges is the real art, and automatic domain randomization (ADR), where the ranges expand only as the policy masters the current ones, was a major piece of OpenAI's dexterous-hand result.

Rule: Randomize the parameters you're uncertain about, proportional to your uncertainty. You know your link lengths to a millimeter — don't randomize them much. You barely know your foot-ground friction — randomize it hard. DR is a way of injecting your honest model uncertainty into training.

A complementary technique is system identification: measure the real robot to narrow the randomization ranges around the truth, then randomize around that. The best pipelines do both — identify what you can measure, randomize what you can't.

Reward shaping and the reward-hacking trap

The reward function is where you specify what you want; the policy decides how. This separation is RL's superpower and its sharpest knife. The optimizer is a literal genie: it maximizes exactly what you wrote, not what you meant.

A locomotion reward is typically a weighted sum of many terms — a "task" term (track the commanded velocity) plus a pile of "regularization" terms (penalize energy, joint-limit violations, body height deviation, foot slip, action rate, orientation tilt). Each term has a weight you tune. Getting the relative weights right is most of the work.

Reward hacking is when the policy finds a high-reward behavior that satisfies your function but violates your intent. Real examples from real projects:

A locomotion policy that vibrates a foot rapidly against the ground because the reward credited "contact" without penalizing wasteful motion.
A policy that exploits a simulator bug — sticking a foot through the floor, or harvesting energy from a contact-impulse glitch — because the sim physics permitted free reward.
A reaching policy that knocks the target off the table so it can never fail to "not be far from it," or learns to hover near a sparse-reward trigger without completing the task.
A walking policy that falls forward in a controlled way to maximize forward velocity for a moment, because the episode-termination penalty was too small to discourage it.

The defenses:

Penalize the means, not just reward the ends. Add energy, smoothness (action-rate), and joint-limit penalties. Most "natural-looking" gait reward is really these regularizers doing their job.
Use termination conditions as hard constraints. Falling, self-collision, or limit violation should end the episode with a penalty — much more reliable than trying to express "don't fall" as a soft reward term.
Watch the rendered rollouts, every time. Numbers lie; video doesn't. Half of reward bugs are obvious the instant you watch the policy.
Curriculum and command sampling. Start easy (low velocities, flat ground) and increase difficulty so the policy doesn't find a degenerate early solution and lock in.

Rule: Budget more time for reward design and debugging than for picking and tuning the algorithm. The algorithm is a solved commodity; your reward is a bespoke specification with bugs in it. Assume reward hacking is happening and go looking for it.

A note on sparse vs dense reward. Sparse reward (1 for success, 0 otherwise) is honest — it can't be gamed by definition — but it's nearly impossible to learn from on hard tasks because the policy rarely stumbles onto success. Dense (shaped) reward learns fast but invites hacking. The pragmatic answer is dense reward built carefully, plus sparse success metrics you track separately to detect when dense-reward optimization has drifted from what you actually want.

Imitation learning: BC, DAgger, and how it complements RL

Sometimes you have demonstrations — teleoperated grasps, motion-capture of a human walking, an existing MPC controller's trajectories. Imitation learning turns demonstrations into a policy, and it's a powerful complement to RL.

Behavior cloning (BC) is supervised learning: collect (state, expert-action) pairs and train the policy to predict the expert's action. It's simple, stable, and fast. Its fatal flaw is compounding error / covariate shift: the policy makes a small mistake, drifts into a state the expert never visited, has no idea what to do there, makes a bigger mistake, and spirals. A BC policy is only as good as its coverage of the states it will actually encounter.

DAgger (Dataset Aggregation) fixes covariate shift by iterating: run the current policy, collect the states it visits, ask the expert to label the correct action in those states, add them to the dataset, retrain. Over rounds the dataset comes to cover the policy's own state distribution and the compounding-error problem largely goes away. The catch is you need an expert you can query on-demand — easy if the expert is an MPC controller or a privileged-state teacher, harder if it's a human.

How they complement RL:

Warm-starting. BC the policy from demonstrations, then refine with RL. The policy starts in a reasonable region instead of flailing randomly, which is huge on tasks where random exploration almost never finds reward.
Style and reference. Motion-capture clips give a humanoid a human-like gait reference; RL then makes it robust. (Adversarial-motion-priors and similar methods reward the policy for looking like the reference distribution.)
The teacher-student recipe (next section) is itself a form of imitation — the student is DAgger-distilled from the teacher.

Rule: Use imitation to get into the right neighborhood; use RL to make it robust. Pure BC rarely survives contact with a real robot's distribution shift; pure RL from scratch wastes enormous compute exploring states demonstrations could have handed you for free.

Teacher-student & privileged learning

This is the single most important practical recipe in legged RL, and it's worth understanding precisely because it solves the perception problem that naïve sim-to-real ignores.

The problem: in simulation you know everything — the exact friction under each foot, the true contact forces, the terrain height around the robot, the disturbance pushing the base. On the real robot you know almost none of that; you have noisy joint encoders, an IMU, and maybe a depth camera. A policy trained on privileged simulator state will be brilliant in sim and useless in reality because its inputs don't exist on the hardware.

The solution is a two-stage teacher-student pipeline (the ETH Zurich / Hutter lab "learning by cheating" recipe):

Stage 1 — train the teacher. Train a policy with RL (PPO) that gets full privileged state as input: true friction, contact states, terrain map, external forces. Because its inputs are clean and complete, the teacher learns an excellent policy fast. It could never run on the real robot — that's fine, it's not meant to.

Stage 2 — distill the student. Train a student policy that uses only deployable observations — proprioception (joint angles/velocities, IMU) plus a short history of past observations and actions — to imitate the teacher's actions via supervised learning / DAgger. The history is the key: the student learns to infer the privileged information (am I on ice? did something just push me?) from the recent time series of what it can actually measure. This is implicit state estimation, learned end-to-end.

The result is a student that matches teacher performance using only onboard sensors. ANYmal's robust blind locomotion over rough terrain (Lee et al., Science Robotics, 2020) was exactly this: a teacher with terrain knowledge, distilled into a proprioception-only student that walked over rubble, mud, snow, and stairs it couldn't see, by feeling the terrain through its legs.

Rule: When the gap between sim-available and robot-available information is large, don't try to train one policy to do everything. Split it: a teacher that learns the skill with cheating inputs, and a student that learns to perceive well enough to execute it. Decoupling "learn the skill" from "learn to perceive" is why this works.

Variants add an explicit belief encoder or a recurrent student, and a related family — RMA (Rapid Motor Adaptation) — trains an adaptation module that estimates a latent "environment embedding" online, achieving the same robustness with a slightly different architecture. The common thread is: learn online estimation of the unobservable, using a history of the observable.

Landmark results: legged, dexterous, humanoid

Three lines of work define what RL can do on real robots, and they're the case studies every practitioner should know.

Legged locomotion (ETH Zurich, Hutter lab; ANYmal)

The ANYmal program turned legged RL from a curiosity into a deployable technology. The 2019 Science Robotics result (Hwangbo et al.) trained control policies in sim with a learned actuator model — a neural net mapping commanded to realized torque, capturing the series-elastic actuators' dynamics — and transferred them to the real ANYmal, achieving faster, more robust locomotion and a dynamic recovery-from-fall behavior that classical methods struggled with. The 2020 follow-up (Lee et al.) added the teacher-student recipe for blind rough-terrain locomotion. The throughline: a learned actuator model plus randomization plus teacher-student made sim-to-real reliable, and it's now the standard recipe across the industry. See legged & quadruped robot hardware.

The Isaac Gym era (2021 onward) collapsed training time: the "Legged Gym" / RSL-RL stack trains an ANYmal or Unitree-class quadruped locomotion policy in minutes to a couple of hours on one GPU. This is what made RL locomotion accessible to small teams.

Dexterous manipulation (OpenAI; Dactyl / Rubik's Cube)

OpenAI's Dactyl trained a Shadow Hand to reorient a block, and later to manipulate a Rubik's Cube one-handed, entirely in sim with PPO and massive domain randomization. The 2019 Rubik's Cube result introduced automatic domain randomization (ADR) — automatically expanding the randomization ranges as the policy improved — which produced a policy robust enough to handle a real hand wearing a rubber glove, with fingers tied together, and other perturbations it never saw in training. The lesson: extreme randomization + ADR can bridge a very hard manipulation gap, but it cost enormous compute (thousands of years of simulated experience). Dexterous manipulation remains far less sample-friendly than locomotion because contact-rich finger-object interaction is harder to simulate accurately.

Humanoid walking (Unitree, and the 2024-2026 wave)

The humanoid surge brought the legged recipe to bipeds. Unitree's H1/G1 and a wave of humanoid programs use PPO-trained locomotion policies, often with motion-capture references (adversarial motion priors / DeepMimic-style style rewards) to get human-like gaits, plus the teacher-student and randomization machinery from the quadruped world. Bipedal balance is less forgiving than quadrupedal — smaller support polygon, higher CoM — so the disturbance-rejection and recovery behaviors matter more, and the sim actuator and contact fidelity bar is higher. The 2024-2026 humanoid demos walking, climbing stairs, and recovering from shoves are overwhelmingly RL locomotion stacks. See humanoid robot hardware.

Pattern across all three: the algorithm (PPO) is the least interesting part. The wins came from the actuator model, the randomization strategy, and the teacher-student / privileged-learning structure. Copy those, not just the optimizer.

Learned vs classical control

This is the question every team argues about, so let's be concrete. Classical control here means model-based methods — PID, LQR, and especially Model Predictive Control (MPC), which optimizes a control sequence over a receding horizon against a dynamics model in real time. RL means a policy trained offline and run as a fast feedforward map.

Dimension	RL policy	MPC / classical
Model requirement	Needs a good simulator, not an analytic model	Needs an accurate online dynamics model
Contact-rich dynamics	Excellent — learns through contact	Hard — contact makes online optimization expensive/brittle
Online compute	Tiny — one forward pass (10s of µs)	Heavy — solve an optimization every control step
Reactivity / latency	Constant, low latency	Depends on solver convergence; can spike
Accuracy / precision	Approximate; no guarantees	High; can hit tight tolerances
Stability guarantees	None (empirical robustness only)	Provable (within model validity)
Interpretability	Low — a black-box net	High — you can read the cost and constraints
Constraint handling	Soft, via reward (can be violated)	Hard, explicit constraints respected
Adaptation to new task	Retrain	Re-tune cost/constraints (often faster)
Development cost	High up front (sim + reward + training)	High expertise, but well-trodden

When RL wins: the dynamics are hard to model online, contacts are numerous and discontinuous, the state/action space is high-dimensional, and you want a reactive policy with constant tiny latency. Legged locomotion over unknown terrain, dexterous in-hand manipulation, whole-body humanoid control, recovery from disturbances. MPC struggles here because solving a contact-rich optimization at 1 kHz is brutal and the model is wrong anyway.

When MPC/classical wins: the model is good, the task is accuracy-critical, constraints are hard and must never be violated, and you need stability guarantees or certification. A 6-axis arm tracing a weld seam to 0.1 mm, a CNC-like motion, a drone trajectory in free space, anything safety-rated. RL's lack of guarantees and its soft constraints are disqualifying here. See motion planning & kinematics for the classical manipulation stack.

The honest 2026 answer is hybrid. The strongest legged systems use RL for the reactive low-level policy and classical methods for high-level planning, footstep selection, or as a safety supervisor. MPC can generate references that RL tracks robustly; RL can warm-start or replace the parts of an MPC stack that the model handles badly. Treating it as RL-vs-MPC religious war misses that they're tools for different layers.

Rule: If you can write a good dynamics model and the task demands accuracy or guarantees, use MPC. If the dynamics are dominated by hard-to-model contact and you need robustness over precision, use RL. Most real robots want both, at different layers of the stack.

Deploying a policy

A trained policy is a pile of weights in a checkpoint. Getting it onto a robot, running reliably in the real-time control loop, is an embedded-systems job that ML people routinely underestimate.

Inference rate. The policy runs inside the control loop, so it must produce an action every control period. Typical rates:

Locomotion: policy at 50-100 Hz, outputting target joint positions, with a downstream PD controller running faster (200 Hz-1 kHz) to track them. This two-rate structure is standard — the policy sets targets, a stiff joint-level controller does the fast tracking.
Manipulation: 30-60 Hz for vision-conditioned policies (camera-bound), up to several hundred Hz for proprioceptive contact-rich control.

Export path. Train in PyTorch, then export to ONNX for a framework-independent, dependency-light artifact. On NVIDIA onboard compute (Jetson Orin), compile the ONNX to TensorRT for lower latency and FP16/INT8 if you need it. For CPU deployment, ONNX Runtime is plenty fast for small MLPs.

Onboard compute reality check. This surprises people: most locomotion policies do not need a GPU on the robot. A typical policy is a 2-3 layer MLP with a few hundred to ~1024 units per layer — on the order of 0.1-2 million parameters. A forward pass is a handful of small matrix multiplies that run in tens of microseconds on a modern CPU core. You add a GPU onboard only when the policy consumes images (vision-based manipulation, exteroceptive locomotion with a learned terrain encoder).

# Locomotion policy inference cost (rough)
# Net: MLP [obs=48] -> 512 -> 256 -> 128 -> [act=12]
# FLOPs per forward pass ≈ 2 * (48*512 + 512*256 + 256*128 + 128*12)
#                        ≈ 2 * 197k  ≈  0.4 MFLOP
#
# At 50 Hz that's 20 MFLOP/s — utterly trivial.
# A single CPU core (~10s of GFLOP/s) runs this in ~tens of µs.
# => No onboard GPU needed for proprioceptive locomotion.

Control-loop integration. The policy is one block in a hard-real-time loop. It must: read the latest observation (assembled to exactly match the sim observation — same order, same scaling, same history length), run inference deterministically (no dynamic allocation, no GC pauses), and write target positions to the joint controllers, all within the period. A jitter spike that misses the deadline can destabilize a balancing robot. Run the policy thread at real-time priority, preallocate everything, and never let it touch the network or filesystem in the hot path.

Rule: Your deployment observation must be byte-for-byte equivalent in meaning to your training observation — same fields, same units, same normalization, same history stacking, same action scaling and clipping. The most common deployment bug isn't the network; it's a mismatched observation or action transform between sim and robot. Write the observation-assembly code once and share it between sim and hardware.

On-robot fine-tuning, safety & limitations

On-robot fine-tuning sounds appealing — close the last bit of the sim-to-real gap by learning on the real machine — and it's mostly a trap. Real data is slow (one robot, real-time), exploration is dangerous (a half-trained policy flails), and the sample-hungry algorithms that work in sim (PPO) are exactly wrong here. If you must, use an off-policy method (SAC) with a tiny step budget, initialize from the sim policy, constrain exploration noise hard, and run an outer classical safety controller that overrides anything dangerous. In practice, most 2026 production stacks deploy a frozen sim-trained policy and improve it by improving the simulator, not by learning on hardware.

Safety is the hard limitation that keeps RL out of certified, high-consequence applications. A learned policy has:

No stability guarantees. Robustness is empirical — it worked across your randomization and test cases — not proven. Out-of-distribution inputs can produce arbitrary outputs.
Soft constraints. "Don't exceed joint limits" lives in the reward and can be violated, unlike MPC's hard constraints.
No interpretability. When it fails, you can't read off why from the weights.

The mitigations are architectural, not algorithmic: action clamping and rate limiting at the joint level (a learned policy should never be able to command beyond hardware limits), a classical safety supervisor / runtime monitor that detects bad states (excessive tilt, limit approach) and triggers a safe fallback (damping-to-stop, sit-down), extensive out-of-distribution testing, and conservative deployment (don't run the policy in regimes far from its training distribution). For functional-safety context this is the same defense-in-depth philosophy as any robot — the learned policy is treated as an untrusted component wrapped in trusted guards.

Other limitations worth stating plainly:

Sim-to-real gap never fully closes. You manage it; you don't eliminate it. Some tasks (precise force control, deformable objects, complex friction) have gaps too large for current sim.
Reward specification is hard. As covered, the reward is a buggy spec and the optimizer exploits it.
Generalization is narrow. A policy trained for one robot and one task transfers poorly to others. There's no free lunch across embodiments yet (large robot-foundation-model efforts are early).
Reproducibility is rough. RL training is seed-sensitive; "it worked once" is not the same as "it works."

Rule: Treat a learned policy as an untrusted component. Wrap it in hard joint-level limits and a classical safety monitor that can take over. Never let the network be the only thing standing between your robot and a hardware-damaging command.

Data & compute budget

The good news for robotics RL: by the standards of large language models, the compute is small. The expensive resource is engineer time, not FLOPs.

Policy size. Locomotion policies are tiny: 2-3 hidden layers, a few hundred K to ~2M parameters. Manipulation and vision-conditioned policies are larger (CNN/transformer front-ends) but still modest. These are not big models.

Training experience. Locomotion needs roughly 1-5 billion simulation steps. Dexterous manipulation with heavy randomization can need far more (OpenAI's Rubik's Cube consumed the equivalent of thousands of simulated years). Most tasks land in the billions-of-steps range.

Wall-clock and hardware. With massively parallel GPU sim:

Quadruped locomotion (flat + rough terrain): ~10 minutes to ~3 hours on a single modern GPU.
Humanoid locomotion: a few hours to ~1 day on one GPU, more if vision-conditioned.
Dexterous manipulation: GPU-days, sometimes a small cluster, because the sim is heavier and the randomization wider.

The cost reality: a flagship locomotion policy costs single-digit to low-tens of dollars of GPU time. The real budget is the weeks of engineer time spent on the simulator's actuator model, the reward function, the observation design, and the sim-to-real debugging. Optimize for engineer iteration speed, not GPU cost. A faster sim that lets you run ten experiments a day is worth more than a marginally better algorithm.

Rule: Don't buy a cluster for robot RL; buy one good GPU and a fast simulator, and spend the saved money on the engineer who designs the reward and the actuator model. That's where the actual difficulty — and the actual cost — lives.

Frequently asked questions

Do I need to learn on the real robot? Almost never in 2026. The dominant paradigm is train-in-sim, deploy-frozen. Real-robot learning is slow, dangerous, and sample-limited. Spend the effort on simulator fidelity and domain randomization instead. On-robot fine-tuning is a niche, last-resort technique fenced by heavy safety guards.

PPO or SAC — which should I start with? If you have a massively parallel simulator (Isaac Lab), start with PPO; it's the most likely to give you a working policy on the first serious attempt and it scales to thousands of environments. If your data is expensive (single sim, real robot, slow sim), use SAC for its sample efficiency. TD3 is a deterministic-policy alternative to SAC; DDPG is obsolete — skip it.

Why does PPO dominate locomotion if it's sample-inefficient? Because with massively parallel sim, samples are nearly free — you generate hundreds of thousands of steps per second. PPO's robustness and stability then matter far more than its sample efficiency. Sample-inefficiency only hurts when data is scarce, which sim isn't.

What's the single most important factor for sim-to-real success? Simulator fidelity, especially the actuator model, plus appropriate domain randomization. The RL algorithm is rarely the bottleneck. A learned or carefully identified actuator model that captures motor delay and torque limits is the highest-leverage thing you can build.

What is teacher-student / privileged learning and why does everyone use it? You train a teacher policy with access to information available only in sim (true friction, contact forces, terrain map), which lets it learn the skill quickly. Then you distill it into a student that uses only onboard sensors plus a short observation history, so the student learns to infer the privileged information online. It decouples learning the skill from learning to perceive, and it's the standard recipe for robust legged locomotion.

Is my reward function going to get hacked? Yes, assume it will. The optimizer maximizes exactly what you wrote, not what you meant. Penalize the means (energy, smoothness, limits), use hard termination conditions for failures, and watch the rendered rollouts — most reward bugs are obvious on video and invisible in the reward curve.

Can RL replace MPC and classical control? No, and you shouldn't want it to. RL wins on contact-rich, hard-to-model, high-dimensional tasks; MPC and classical control win on well-modeled, accuracy-critical, constraint-hard, certification-needing tasks. The best systems are hybrids that use each where it's strong. Don't put a learned policy on a precision weld seam.

How much compute do I need? Less than you think. A quadruped locomotion policy trains in minutes to hours on a single modern GPU; the policy itself is a few-million-parameter MLP. Dexterous manipulation is heavier (GPU-days). The expensive resource is engineer time on reward and sim design, not GPU hours.

Do I need a GPU on the robot? For proprioceptive locomotion, no — the policy is a small MLP that runs in tens of microseconds on a CPU core. You need onboard GPU only when the policy consumes images (vision-based manipulation, learned terrain encoders from depth/camera). See robot sensors for what those inputs look like.

What framework should I use in 2026? Isaac Lab (NVIDIA) is the dominant massively-parallel framework, built on Isaac Sim, succeeding the original Isaac Gym. MuJoCo (now with the GPU-accelerated MJX) and Brax are strong alternatives, especially for research and lighter-weight setups. For the RL algorithm code, RSL-RL (PPO, from ETH) and Stable-Baselines3 / CleanRL are common. See robot simulation & digital twins.

Why does my policy work in sim but fall on the real robot? The usual suspects, in order: (1) observation/action mismatch between sim and hardware — wrong order, scaling, units, or history length; (2) actuator model in sim doesn't capture real motor delay/limits; (3) insufficient or wrong domain randomization, so the policy overfit sim; (4) control-loop latency or jitter on the robot the policy never saw. Check the observation pipeline first — it's the most common bug.

How do imitation learning and RL fit together? Use imitation (behavior cloning, DAgger) to get the policy into a sensible region or to provide a style reference (e.g., human motion-capture for humanoid gait), then use RL to make it robust and high-performance. Pure BC suffers compounding error and rarely survives the real distribution; pure from-scratch RL wastes compute exploring states demonstrations could have provided.