ROS 2 for Robotics: The Ultimate Guide

Q: Why don't my messages arrive even though the publisher is running?

The overwhelmingly likely cause is a QoS mismatch: a `RELIABLE` subscriber will not connect to a `BEST_EFFORT` publisher. Run `ros2 topic info /your_topic -v` and compare the QoS on both ends. Sensor topics are usually best-effort; subscribe with the sensor QoS profile.

Q: Do I have to use C++, or is Python fine?

Both are first-class. Use `rclpy` for glue, configuration, prototyping, and non-time-critical nodes; use `rclcpp` for anything on a hot path or where you need real control over allocation and the executor. Most production robots run a mix.

ROS 2 is the thing your robot's software is probably built on, and the thing that will quietly eat a third of your debugging time. It is not an operating system — it is a middleware, a build system, a set of conventions, and a community of ten-thousand-plus packages that mostly assume you are running the same setup as the person who wrote them. When it works, you wire a LiDAR driver to a SLAM node to a planner to a motor controller in an afternoon. When it doesn't, you are reading DDS discovery logs at midnight wondering why two nodes on the same machine can't see each other.

This guide is for engineers who already know what a robot is and now have to make the software stack behave: people moving over from ROS 1, makers scaling a hobby project into something that has to run for months, integrators stitching vendor stacks together. We will cover what ROS actually is, why ROS 2 was a ground-up rewrite rather than a patch, the core graph concepts with real rclpy code, the DDS layer and the QoS settings that cause most of the "my messages don't arrive" tickets, colcon and the overlay model, ros2_control, Nav2, MoveIt 2, simulation, the real-time story, and what it takes to put this in a product. Real specifics throughout: Jazzy, Kilted, Rolling; Fast DDS, Cyclone DDS, Zenoh; the executors, the launch system, micro-ROS, SROS2.

The take: ROS 2 is the right default for almost any robot that is more than one microcontroller, but it buys you the ecosystem and the tooling, not determinism — the hard-real-time loop still has to live below it, and 80% of new-user pain is three QoS knobs and one DDS discovery setting. Learn DDS early, treat ROS 2 as the orchestration layer over a deterministic control layer, and most of the mystery evaporates.

Companion reading: real-time robot control, motor controllers & FOC, motion planning & kinematics, and mobile robots: AMRs & AGVs.

Key takeaways
What ROS is and isn't
ROS 1 to ROS 2: why the rewrite
Core concepts & the compute graph
DDS & the middleware layer
QoS deep-dive
Build system & workspaces
ros2_control
Navigation: Nav2
Manipulation: MoveIt 2
Simulation & tooling
Real-time & determinism
Production ROS 2 & should you use it
Frequently asked questions

Key takeaways

ROS is not an operating system. It is a middleware (pub/sub, services, actions) plus a build system (colcon/ament), a packaging convention, and a huge package ecosystem. It runs on Linux (and increasingly other targets). The "OS" in the name is historical.
ROS 2 is a rewrite, not an upgrade. The master-based ROS 1 graph was replaced with a fully decentralized, DDS-based one. No roscore, peer-to-peer discovery, configurable reliability, security, and a real-time-friendly C++ core (rclcpp).
DDS is the part that surprises everyone. ROS 2 talks to the network through a pluggable RMW layer over DDS (or Zenoh). The default vendor and the QoS profile you pick decide whether your messages arrive, how discovery scales, and how much your sensor stream costs in CPU.
QoS mismatch is the #1 new-user bug. A reliable subscriber will not receive from a best-effort publisher, and vice versa. Sensor data wants best-effort; commands and TF want reliable. Get the three knobs — reliability, durability, history — right and most "no messages" mysteries vanish.
Fast DDS, Cyclone DDS, Zenoh all coexist in 2026. Fast DDS is the Jazzy/Kilted default; Cyclone DDS is the pragmatic favorite for many fleets; rmw_zenoh is the rising option that fixes large-graph discovery and WAN/multi-robot pain.
colcon + the overlay model is how you build. You source /opt/ros/jazzy/setup.bash (the underlay), build your workspace with colcon build, source install/setup.bash (the overlay), and your packages shadow the system ones.
ros2_control is the hardware abstraction layer. A real-time controller_manager runs a read → update → write loop; hardware interfaces talk to your drives; controllers (diff-drive, joint trajectory, etc.) are swappable at runtime. See real-time control and motor controllers & FOC.
Nav2 and MoveIt 2 are the flagship application stacks. Nav2 drives mobile bases with behavior trees over costmaps, planners, and controllers. MoveIt 2 plans arm motion with kinematics, collision checking, and a planning scene. Both are production-grade and both are heavy.
ROS 2 nodes are not hard-real-time. The executor/callback model, default allocator, and DDS make soft-real-time achievable but not guaranteed. The deterministic loop lives in ros2_control, in micro-ROS on an MCU, or below ROS entirely.
micro-ROS puts ROS 2 on microcontrollers. A client library + agent bridge lets an STM32 or ESP32 publish/subscribe into the same graph — the right place for the kHz control loop.
Pin your distro. Jazzy Jalisco (LTS, May 2024, supported to 2029) is the 2026 production default; Kilted Kaiju (May 2025, non-LTS) and Rolling are for the bleeding edge. Match your Ubuntu LTS to your ROS distro.
Security is opt-in via SROS2. DDS-Security gives you authentication, encryption, and access control, but it is off by default and adds latency and operational weight.

What ROS is and isn't

The name is a lie of history. ROS — the Robot Operating System — is not an operating system. It does not schedule processes, manage memory, or boot your machine. Linux does that. What ROS gives you is the layer above the OS where robot code lives: a way for many programs to find each other and exchange data, a build system to compile them, a packaging convention so you can share them, and an ecosystem of pre-built capabilities so you do not write your own SLAM, your own TF math, or your own LiDAR driver.

Concretely, ROS is three things wearing one name.

The plumbing. A publish/subscribe message bus, plus request/response services and long-running actions, that lets independent processes ("nodes") talk over named channels ("topics") using strongly-typed messages. This is the part people mean when they say "ROS." In ROS 2 the plumbing is DDS (more on that below).

The tooling. colcon to build, ros2 CLI to introspect and launch, rviz2 to visualize, ros2 bag to record and replay, tf2 to track coordinate frames, the launch system to bring up dozens of nodes with one command. This is half the actual value — you can replace the message bus, but rewriting tf2 and rviz2 is a multi-year project nobody wants.

The community. Ten thousand-plus packages on the ROS index, vendor drivers for most sensors and arms, and the convention that everyone's code uses the same message types (sensor_msgs/Image, geometry_msgs/Twist, nav_msgs/Odometry). The standardization is the moat. A sensor_msgs/LaserScan from a Hokuyo and one from a Velodyne look identical to your SLAM node.

Rule of thumb: if your robot is one microcontroller running one control loop, you do not need ROS. If it has a LiDAR, a planner, a base, and a manipulator that all have to share data and you want off-the-shelf navigation, you almost certainly do.

What ROS is not: it is not real-time by itself, it is not a guarantee of message delivery (that is a QoS choice you make), and it is not a substitute for understanding your hardware. It is orchestration. The robot still has to be a good robot underneath.

ROS 1 to ROS 2: why the rewrite

ROS 1 shipped in 2007 out of Willow Garage and powered most of academic robotics for fifteen years. It had one architectural decision that aged badly: a central roscore (the "master") that every node registered with to find every other node. The master was a single point of failure, a single point of discovery, and a thing that did not exist on the robot until someone started it.

ROS 2 is a from-scratch rewrite — not an incremental upgrade — explicitly designed to fix the things that kept ROS 1 out of products.

No master, decentralized discovery. ROS 2 nodes find each other peer-to-peer over the network using DDS discovery. There is no roscore. Kill any node and the rest keep talking. This is the single biggest architectural change and the reason multi-robot and fault-tolerant systems became practical.

DDS as the transport. Instead of ROS 1's custom TCPROS/UDPROS, ROS 2 sits on the Data Distribution Service — a mature OMG industrial standard already used in aerospace, defense, and finance. You inherit configurable reliability, QoS, and a real ecosystem of vendors.

Real-time-friendly core. rclcpp is written so the hot path can avoid allocations and locks, which makes soft-real-time achievable. ROS 1's Python-and-C++ core never tried.

Multi-robot and multi-machine by design. Discovery domains, namespacing, and DDS partitions make running ten robots on one network a configuration problem, not a research project.

Security. DDS-Security (exposed as SROS2) adds authentication, encryption, and access control. ROS 1 had nothing — every topic was world-readable on the LAN.

Production focus. Lifecycle (managed) nodes, deterministic launch, component composition into a single process, and a cross-platform build (Linux, Windows, macOS, RTOS via micro-ROS).

The migration is not free. APIs changed, the build system changed (catkin → ament/colcon), launch files moved from XML-only to Python/XML/YAML, and the conceptual model now includes QoS, which did not exist in ROS 1.

EOL context: ROS 1 Noetic — the last ROS 1 distro — reached end of life in May 2025, tied to Ubuntu 20.04's EOL. There are no more ROS 1 releases. If you are starting anything new in 2026, it is ROS 2. If you are maintaining a Noetic system, you are on borrowed time and unsupported.

Here is the practical comparison.

Aspect	ROS 1 (Noetic, EOL May 2025)	ROS 2 (Jazzy/Kilted, 2026)
Discovery	Central `roscore` master	Decentralized, DDS peer-to-peer
Transport	TCPROS / UDPROS (custom)	DDS (Fast DDS, Cyclone) or Zenoh
QoS	None — TCP reliable only	Configurable per topic (reliability/durability/...)
Real-time	Not designed for it	RT-friendly C++ core, RT executors
Multi-robot	Painful, namespace hacks	Domains, partitions, native
Security	None	SROS2 / DDS-Security (opt-in)
Build system	catkin	ament + colcon
Launch	XML only	Python / XML / YAML
Client libs	roscpp, rospy	rclcpp, rclpy (over rcl/rmw)
MCU support	rosserial (limited)	micro-ROS (real client)
Lifecycle nodes	No	Yes (managed nodes)
Status	End of life	Active, LTS available

Core concepts & the compute graph

A running ROS 2 system is a graph: a set of nodes connected by topics, services, and actions. Understanding the graph is understanding ROS.

Nodes. A node is a unit of computation — usually one process, sometimes many composed into one process. A LiDAR driver is a node. A SLAM algorithm is a node. Your motor bridge is a node. Nodes have names, live in namespaces, and own publishers, subscribers, services, and parameters.

Topics and pub/sub. Topics are named, typed, many-to-many channels. A publisher writes sensor_msgs/PointCloud2 to /points; any number of subscribers read it. Anonymous and decoupled — the publisher does not know who listens. This is how streaming data (sensors, odometry, transforms) flows. Topics are the workhorse; 90% of your data moves on them.

Services. Synchronous request/response. A client calls /spawn, blocks (or awaits), gets one response. Use for short, occasional queries — "give me the current map," "switch to mode 2." Do not use for anything long-running; that is what actions are for.

Actions. Long-running, cancelable, goal-oriented calls with feedback. "Navigate to (x, y)" is an action: you send a goal, get periodic feedback (distance remaining), can cancel, and eventually get a result. Built on top of topics and services. Nav2 and MoveIt 2 are action-driven.

Parameters. Per-node configuration values (an int, a string, a double, an array) that can be set at launch and changed at runtime. Your controller's gains, a camera's frame rate, a topic remap — all parameters.

The compute graph is all of this together, plus the discovery that wires it. You introspect it live with the CLI:

$ ros2 node list
/lidar_driver
/slam_toolbox
/controller_manager

$ ros2 topic list -t
/cmd_vel [geometry_msgs/msg/Twist]
/odom [nav_msgs/msg/Odometry]
/scan [sensor_msgs/msg/LaserScan]
/tf [tf2_msgs/msg/TFMessage]

$ ros2 topic hz /scan
average rate: 9.998
	min: 0.099s max: 0.101s std dev: 0.00038s window: 10

$ ros2 topic echo /odom --once
$ ros2 node info /slam_toolbox      # publishers, subscribers, services

A minimal publisher and subscriber in rclpy — the canonical "hello robot." This is the shape of nearly every ROS 2 Python node you will write:

import rclpy
from rclpy.node import Node
from std_msgs.msg import String


class Talker(Node):
    def __init__(self):
        super().__init__("talker")
        self.pub = self.create_publisher(String, "chatter", 10)
        self.timer = self.create_timer(0.5, self.tick)   # 2 Hz
        self.i = 0

    def tick(self):
        msg = String()
        msg.data = f"hello {self.i}"
        self.pub.publish(msg)
        self.get_logger().info(f"published: {msg.data}")
        self.i += 1


class Listener(Node):
    def __init__(self):
        super().__init__("listener")
        self.sub = self.create_subscription(String, "chatter", self.cb, 10)

    def cb(self, msg):
        self.get_logger().info(f"heard: {msg.data}")


def main():
    rclpy.init()
    # in practice each runs in its own process; shown together for brevity
    node = Talker()
    rclpy.spin(node)
    node.destroy_node()
    rclpy.shutdown()


if __name__ == "__main__":
    main()

The 10 passed to create_publisher/create_subscription is the QoS history depth — a shortcut for "keep the last 10 messages." That single integer is hiding the entire QoS system, which is where we go next.

The C++ side (rclcpp) mirrors this exactly: a rclcpp::Node, create_publisher<std_msgs::msg::String>, a create_wall_timer, and rclcpp::spin. Use rclpy for glue, configuration, and prototyping; use rclcpp for anything on a hot path — it is faster and gives you real control over the executor and allocation.

DDS & the middleware layer

This is the layer that costs new users the most time, so it is worth getting right.

ROS 2 does not implement its own networking. It defines an abstract middleware interface — the RMW (ROS MiddleWare) layer — and plugs a real implementation in behind it. By default that implementation is DDS. Your code calls rclpy/rclcpp, which call rcl (the common C client library), which calls the rmw interface, which calls the DDS vendor. Swap the vendor with one environment variable; your code does not change.

export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp   # or rmw_fastrtps_cpp, rmw_zenoh_cpp

Discovery. When a node starts, DDS announces itself on the network (by default via multicast) and learns about everyone else. There is no central registry — this is the "no master" promise made real. Nodes on the same ROS_DOMAIN_ID (0–232, default 0) discover each other; different domains are invisible to each other, which is how you isolate two robots on one LAN.

Discovery is also where large graphs hurt. Classic DDS does an all-to-all handshake, so a graph with hundreds of nodes can spend serious CPU and bandwidth just on discovery chatter. Fast DDS offers a Discovery Server (a hub-and-spoke discovery broker) to fix this; Zenoh sidesteps it with a router model. If your ros2 node list takes seconds or your CPU sits at a baseline load with nothing publishing, discovery is the suspect.

Rule: isolate robots and dev machines by ROS_DOMAIN_ID. Two engineers on the same office LAN with the default domain 0 will see each other's nodes and very much confuse each other.

The three RMW implementations that matter in 2026:

RMW	Default in	Strengths	Watch out for
Fast DDS (eProsima) `rmw_fastrtps_cpp`	Jazzy, Kilted (tier-1 default)	Mature, feature-rich, Discovery Server, big-graph options, shared-memory transport	XML-heavy tuning; default discovery scales poorly without the server
Cyclone DDS (Eclipse) `rmw_cyclonedds_cpp`	(tier-1, common fleet choice)	Lean, predictable, simple to tune, great latency; favored by many production teams	Fewer exotic features; some configs need OS-level multicast/NIC tuning
Zenoh (Eclipse) `rmw_zenoh_cpp`	Rolling/Kilted (rising, officially supported)	Solves large-graph discovery, router model, excellent over WAN / multi-robot / lossy links	Newer in ROS; bridge/router is an extra moving part to deploy

Which to pick. Stay on the distro default (Fast DDS) unless you have a reason. The two common reasons: (1) you have a large or flaky network and discovery is the bottleneck — move to rmw_zenoh or Fast DDS Discovery Server; (2) you want lean, predictable latency on a single robot and like simple config — Cyclone DDS is the pragmatic pick many fleets settle on. All three are real, supported choices in 2026; Zenoh is the one to watch because it directly addresses the multi-robot and WAN cases classic DDS handles badly.

The key mental model: DDS is doing a lot of work you can't see, and its behavior is governed by QoS.

QoS deep-dive

Quality of Service is the set of per-topic policies that decide how messages are delivered. ROS 1 had exactly one behavior (TCP, reliable, in-order). ROS 2 makes it a choice — which is powerful and which causes the single most common new-user bug: a publisher and subscriber with incompatible QoS silently fail to connect.

The three policies you touch constantly:

Reliability.

RELIABLE — DDS retransmits until delivery is confirmed. Use for commands, transforms, anything where a dropped message breaks behavior.
BEST_EFFORT — fire and forget, no retransmit. Use for high-rate sensor data where the next sample is along in 10 ms anyway and you would rather drop than buffer.

Durability.

VOLATILE — subscribers only get messages published after they join.
TRANSIENT_LOCAL — the publisher keeps the last N messages and delivers them to late-joining subscribers. This is how "latched" topics work: a map, a robot description, a static transform published once at startup still reaches a node that connects a minute later.

History.

KEEP_LAST (depth N) — keep the last N samples. The integer 10 in create_publisher(..., 10) is KEEP_LAST depth 10.
KEEP_ALL — keep everything (bounded by resource limits). Rarely what you want.

Plus two you reach for in real systems: Deadline (the maximum expected gap between messages — violated, you get a callback, useful for detecting a dead sensor) and Liveliness (a heartbeat contract — declare a node dead if it stops asserting liveliness within a lease duration).

The #1 QoS rule: compatibility follows the request-vs-offered model. A subscriber requesting RELIABLE will not connect to a BEST_EFFORT publisher (it is asking for more than offered). A BEST_EFFORT subscriber will connect to a RELIABLE publisher. When messages "don't arrive," run ros2 topic info /topic -v and compare the QoS on both ends before you touch anything else.

ROS 2 ships named profiles so you do not hand-build these. The important ones:

Profile	Reliability	Durability	History	Use for
Default (`rclcpp::QoS(10)`)	RELIABLE	VOLATILE	KEEP_LAST 10	General topics, commands
Sensor data	BEST_EFFORT	VOLATILE	KEEP_LAST 5	LiDAR, camera, IMU at high rate
Services / parameters	RELIABLE	VOLATILE	KEEP_LAST	RPC-style calls
TF (`/tf`)	RELIABLE	VOLATILE	KEEP_LAST 100	Transform broadcasts
TF static (`/tf_static`)	RELIABLE	TRANSIENT_LOCAL	KEEP_LAST	Static transforms, latched

Picking the profile in rclpy:

from rclpy.qos import QoSProfile, ReliabilityPolicy, DurabilityPolicy, HistoryPolicy

# A camera at 30 Hz: drop is fine, latency matters.
sensor_qos = QoSProfile(
    reliability=ReliabilityPolicy.BEST_EFFORT,
    durability=DurabilityPolicy.VOLATILE,
    history=HistoryPolicy.KEEP_LAST,
    depth=5,
)
self.create_subscription(Image, "/camera/image_raw", self.cb, sensor_qos)

# A latched map: a node that joins late must still get it.
map_qos = QoSProfile(
    reliability=ReliabilityPolicy.RELIABLE,
    durability=DurabilityPolicy.TRANSIENT_LOCAL,
    history=HistoryPolicy.KEEP_LAST,
    depth=1,
)
self.create_publisher(OccupancyGrid, "/map", map_qos)

The single most useful habit: match the publisher's profile. When you subscribe to a vendor's camera topic and get nothing, the vendor almost certainly published BEST_EFFORT and you defaulted to RELIABLE. Use the sensor profile. The same logic governs reading any sensor stream — see the robot sensors guide for which sensor classes tolerate drops (LiDAR, depth) and which do not (encoder counts, safety signals).

Build system & workspaces

ROS 2 builds with colcon, a meta-build tool that drives the underlying build systems (ament_cmake for C++, ament_python for pure Python) across all packages in a workspace, resolving build order from declared dependencies.

Package. The unit of distribution. A directory with a package.xml (metadata + dependencies) and either a CMakeLists.txt (C++) or setup.py/setup.cfg (Python). One package = one logical chunk: a driver, a set of nodes, a message definition set.

Workspace. A directory with a src/ folder full of packages. You build from the workspace root:

$ mkdir -p ~/ws/src && cd ~/ws
$ git clone https://github.com/example/my_robot_pkg src/my_robot_pkg
$ rosdep install --from-paths src --ignore-src -r -y   # pull deps
$ colcon build --symlink-install
$ source install/setup.bash
$ ros2 launch my_robot_pkg bringup.launch.py

--symlink-install symlinks Python files and resources instead of copying, so edits to a Python node take effect without rebuilding. Indispensable during development.

The overlay model is the part worth internalizing. ROS layers environments:

Underlay: the system install, source /opt/ros/jazzy/setup.bash. This puts the entire distro on your path.
Overlay: your workspace, source ~/ws/install/setup.bash. Packages here shadow same-named packages in the underlay.

You can stack overlays. This is how you patch one package without rebuilding the world: clone just that package into a new workspace, build it, source it last, and your version wins. It is also how you create a "did I forget to source?" bug, which is the second most common new-user issue after QoS mismatch. If ros2 run my_pkg my_node says package not found, you forgot to source the overlay.

Rule: put source /opt/ros/jazzy/setup.bash in your shell profile; source the workspace overlay manually per-shell. Auto-sourcing overlays bites you when you have several workspaces.

Launch files bring up many nodes, set parameters, and remap topics with one command. ROS 2 supports Python, XML, and YAML. Python is the most powerful (it is code — conditionals, loops, computed values); XML/YAML are cleaner for static setups. A minimal Python launch:

from launch import LaunchDescription
from launch_ros.actions import Node


def generate_launch_description():
    return LaunchDescription([
        Node(
            package="my_robot_pkg",
            executable="motor_bridge",
            name="motor_bridge",
            parameters=[{"wheel_radius": 0.05, "max_rpm": 3000}],
            remappings=[("cmd_vel", "/diff_drive/cmd_vel")],
        ),
        Node(
            package="sllidar_ros2",
            executable="sllidar_node",
            parameters=[{"serial_port": "/dev/ttyUSB0", "frame_id": "laser"}],
        ),
    ])

For real systems, parameters move out into YAML files loaded per node, and you compose smaller launch files with IncludeLaunchDescription. Keep launch files modular — one per subsystem — and have a top-level bringup.launch.py include them.

ros2_control

ros2_control is the hardware abstraction framework, and it is one of the best-designed parts of the ecosystem. It separates what you command (a controller producing setpoints) from how the hardware is driven (a hardware interface talking to your actual drives), with a real-time loop sitting between them.

The pieces:

controller_manager. The orchestrator. It runs the real-time read → update → write loop at a fixed rate (commonly 100–1000 Hz). On each cycle it reads state from the hardware, runs the active controllers' update(), and writes commands back. This loop is where determinism matters — run it on an isolated, SCHED_FIFO core if you can. See the real-time control guide for why that matters and how to set it up.

Hardware interfaces (hardware components). Plugins that expose your robot's state interfaces (position, velocity, effort it can read) and command interfaces (position, velocity, effort it can write). You write one of these to talk to your CANopen drives, your EtherCAT bus, or your serial motor controller. This is the layer that hides whether the joint is driven by a FOC controller over CAN or a hobby servo over PWM.

Controllers. Swappable algorithms that read state interfaces and write command interfaces. Stock ones cover most needs: diff_drive_controller (mobile base), joint_trajectory_controller (arm trajectory tracking), forward_command_controller, imu_sensor_broadcaster, joint_state_broadcaster. You can load, unload, activate, and deactivate them at runtime.

The hardware description lives in the URDF as <ros2_control> tags, and controllers are configured in YAML:

controller_manager:
  ros__parameters:
    update_rate: 1000  # Hz
    diff_drive_controller:
      type: diff_drive_controller/DiffDriveController
    joint_state_broadcaster:
      type: joint_state_broadcaster/JointStateBroadcaster

diff_drive_controller:
  ros__parameters:
    left_wheel_names:  ["left_wheel_joint"]
    right_wheel_names: ["right_wheel_joint"]
    wheel_separation: 0.40   # m
    wheel_radius: 0.05       # m
    cmd_vel_timeout: 0.5     # s — stop if no command

$ ros2 control list_hardware_interfaces
$ ros2 control list_controllers
diff_drive_controller    [diff_drive_controller/DiffDriveController]  active
joint_state_broadcaster  [...JointStateBroadcaster]                   active
$ ros2 control switch_controllers --activate diff_drive_controller

Rule: the controller_manager loop is soft-real-time at best on stock Linux. The kHz current loop that actually commutates the motor belongs in the drive's firmware (or in micro-ROS on an MCU), not in a ROS 2 controller. ros2_control commands velocity/position; the drive closes the fast loop. Mixing these layers is a classic mistake — see the FOC controllers guide.

The win is that swapping hardware — say from a serial-driven base to an EtherCAT one — touches only the hardware interface. The controllers, the URDF kinematics, and everything above are untouched.

Navigation: Nav2

Nav2 is the ROS 2 navigation stack: the descendant of ROS 1's move_base, rebuilt around lifecycle nodes and behavior trees. It turns "go to this pose" into wheel commands while avoiding obstacles, recovering from failures, and replanning. If you are building an AMR, this is your starting point — see the mobile robots guide.

The architecture, top to bottom:

Behavior Tree (BT) Navigator. The brain. Nav2 does not hard-code the navigation logic — it runs an editable behavior tree (an XML file) that sequences "compute a path," "follow the path," and recovery behaviors ("clear costmap, spin, back up, wait"). Want different recovery logic? Edit the tree, no recompile. This is a genuinely good design; it makes the failure handling inspectable and tunable.

Costmaps. A 2D grid of traversal cost built from sensor data. The global costmap covers the whole known map (for the planner); the local costmap is a rolling window around the robot (for the controller and immediate obstacle avoidance). Layers stack: static (the map), obstacle (live LiDAR/depth), inflation (a safety buffer around obstacles sized to the robot's footprint).

Planners. Compute a global path from start to goal over the global costmap. NavFn (Dijkstra/A*), Smac (a state-lattice/hybrid-A* family that respects vehicle kinematics — important for car-like or large differential robots), and Theta* are the stock options.

Controllers (local planners). Follow the global path while reacting to the local costmap, emitting cmd_vel. DWB (the configurable Dynamic Window successor), the Regulated Pure Pursuit controller (RPP — simple, robust, slows in tight spaces and near goals; a favorite for warehouse AMRs), and MPPI (a sampling-based model-predictive controller, heavier but smoother and better at tight maneuvering).

Localization. AMCL (adaptive Monte Carlo localization) against a static map, or you feed in a pose from a SLAM system like slam_toolbox.

$ ros2 launch nav2_bringup navigation_launch.py
$ ros2 topic pub /goal_pose geometry_msgs/PoseStamped "..."   # or use the rviz2 goal tool

Tuning reality: Nav2 works out of the box on a simulated TurtleBot and then takes weeks to tune on a real 200 kg AMR. The robot footprint, inflation radius, controller lookahead, costmap update rate, and the velocity/acceleration limits all interact. Budget the tuning time. The defaults are a starting point, not a deployment.

Nav2 is heavy — several nodes, costmaps eating CPU proportional to map size and update rate — but it is production-grade and runs on real fleets. For perception input, the costmap obstacle layer consumes LiDAR scans and depth point clouds; the LiDAR & depth cameras guide covers picking those sensors.

Manipulation: MoveIt 2

MoveIt 2 is to arms what Nav2 is to mobile bases: the flagship motion-planning framework. It takes "put the end-effector here" and produces a collision-free, kinematically valid joint trajectory, then hands it to a trajectory controller (typically ros2_control's joint_trajectory_controller).

The pieces:

Kinematics. Forward kinematics is easy; the hard part is inverse kinematics (joint angles for a desired pose). MoveIt 2 plugs in IK solvers: KDL (generic, numerical), TRAC-IK (faster, more reliable convergence), or a generated analytic solver (IKFast) for a specific arm. For the math behind this, see the motion planning & kinematics guide.

Motion planning. The planner finds a path through joint space that avoids collisions. The default pipeline uses OMPL (sampling-based planners like RRTConnect — fast at finding a path, not an optimal one). Pilz delivers deterministic industrial motions (lines, circles, point-to-point with defined velocity profiles). STOMP/CHOMP do optimization-based planning. For most pick-and-place, OMPL + a smoothing/time-parameterization pass is the workhorse.

Planning scene. MoveIt 2's world model — the robot's current state plus collision objects (the table, the bin, the part in the gripper). The planner checks every candidate motion against this scene. Attach an object to the gripper and it moves with the arm in the collision model. Keep the planning scene accurate or the planner will either refuse valid motions or plan into things that are really there.

Trajectory execution. The planned trajectory is time-parameterized (respecting joint velocity/acceleration limits) and sent to the controller for execution, with optional online monitoring.

$ ros2 launch moveit2_tutorials demo.launch.py   # rviz2 MotionPlanning panel
# Drag the interactive marker, "Plan & Execute"

The MoveIt Setup Assistant generates the configuration package (SRDF defining planning groups, collision matrices, IK config) from your URDF — start there for a new arm.

Reality: MoveIt 2 plans beautiful trajectories in RViz and then collides with reality the first time your planning scene is wrong or your gripper-to-flange transform is off by a centimeter. Manipulation is unforgiving about calibration and collision geometry. For the hardware side of the arm itself, see the industrial robot arms guide.

Sampling-based planning is non-deterministic by default — RRTConnect gives you a different valid path each run. For industrial cells that need repeatable, certifiable motions, use Pilz or a pre-computed trajectory; do not rely on a fresh OMPL plan being the same twice.

Simulation & tooling

The tooling is half of why ROS 2 is worth using. The big ones:

Gazebo (formerly Ignition). The default simulator. Note the naming mess: "Gazebo Classic" (the old one) is end-of-life; "Gazebo" (formerly "Ignition Gazebo," versioned Harmonic, Ionic) is the current one. It simulates physics, sensors (LiDAR, cameras, IMU return realistic data), and your robot's URDF/SDF, and it integrates with ros2_control via gz_ros2_control so the same controllers run in sim and on hardware. That last part is the point: you develop against the sim, then change the hardware interface and run the identical stack on the robot.

RViz2. 3D visualization. It is not a simulator — it draws what the graph is publishing: the robot model, TF frames, LiDAR scans, costmaps, planned paths, point clouds. Your first debugging move for almost any robot problem is "open RViz2 and see what the robot thinks is happening." Frames in the wrong place, a costmap that looks wrong, a LiDAR scan pointing the wrong way — RViz2 shows it instantly.

ros2 bag. Records topics to a file (the .mcap format is now the default and worth using over the old sqlite3) and replays them. This is your robot's flight recorder. Record a failure on the real robot, replay it at your desk into your perception/planning nodes, and debug offline. It is also how you build datasets and regression tests. Record everything during field tests; storage is cheap, a reproduced failure is priceless.

$ ros2 bag record -a -o field_test_01           # record all topics
$ ros2 bag record /scan /odom /tf /camera/image_raw   # or be selective
$ ros2 bag play field_test_01 --rate 0.5        # replay at half speed
$ ros2 bag info field_test_01

rqt — a Qt-based plugin GUI for graph inspection (rqt_graph), live plotting (rqt_plot), parameter editing, and console log viewing.

The sim-to-real workflow in practice: model the robot in URDF, validate kinematics and controllers in Gazebo, develop perception against simulated sensors and against recorded real bags (sim sensors are too clean — real LiDAR has dropouts, real cameras have motion blur), then deploy the identical node graph to hardware with only the hardware interface and sensor drivers swapped. The gap between a clean sim and a noisy robot is where most of the real engineering is; do not trust a behavior that has only ever run in Gazebo.

Real-time & determinism

This is where expectations and reality collide, so be precise about it.

A ROS 2 node is not hard-real-time, and the framework does not pretend otherwise. Pub/sub over DDS, dynamic memory allocation in the default path, the OS scheduler, and garbage collection (in Python) all introduce jitter. You can get soft real-time — bounded-most-of-the-time latency, good enough for navigation and trajectory following — but you cannot get a guaranteed sub-millisecond deadline out of a stock rclpy node on stock Linux.

The executor and callback model. A ROS 2 node's callbacks (subscription callbacks, timers, service handlers) run inside an executor. The default SingleThreadedExecutor runs one callback at a time, in a non-obvious order, on one thread — fine for simple nodes, a bottleneck and a jitter source when callbacks are heavy. The MultiThreadedExecutor runs callbacks in parallel across a thread pool, but then you need callback groups to control which callbacks can run concurrently (mutually-exclusive vs. reentrant) or you will create race conditions. Picking and configuring the executor is the main lever you have over a node's timing behavior.

What is RT-safe. rclcpp was built so the publish/subscribe hot path can avoid allocations if you pre-allocate messages and use the right allocator. There is work on real-time-safe executors and a picas/callback-group-based scheduling line of research. But "RT-safe ROS 2" means careful C++ (no new in the loop, no unbounded queues, real-time-priority threads, locked memory), not "it works because it's ROS 2."

Where the deterministic loop actually lives:

The kHz motor commutation/current loop: in the drive firmware (FOC controller), not ROS. See motor controllers & FOC.
The 100–1000 Hz joint control loop: in ros2_control's controller_manager, pinned to an isolated SCHED_FIFO core, ideally on a PREEMPT_RT kernel.
The hard, fast, safety-critical loop on a microcontroller: in micro-ROS or in bare firmware below ROS.

Rule: treat ROS 2 as the orchestration and perception/planning layer (soft RT, tens to hundreds of Hz) and keep the hard-real-time loop below it. Architectures that try to close a 1 kHz servo loop through the DDS graph on general-purpose Linux will work in the demo and bite you in the field. The full treatment is in the real-time control systems guide.

If you genuinely need determinism inside ROS 2: run a PREEMPT_RT kernel, isolate CPUs (isolcpus), set SCHED_FIFO priorities, lock memory (mlockall), pre-allocate everything in the loop, use Cyclone DDS or a tuned Fast DDS with shared-memory transport, and measure jitter with cyclictest and your own latency tracing. It is achievable and several teams do it. It is not free and it is not automatic.

Production ROS 2 & should you use it

Getting a demo running is a weekend. Shipping a product is a different sport. Here is what production actually requires.

Security: SROS2

By default, every topic on the network is readable and writable by anyone who can reach it — exactly like ROS 1. SROS2 wraps DDS-Security to add authentication (nodes prove identity with certificates), encryption (traffic is unreadable on the wire), and access control (a policy file says which node may publish/subscribe to which topic). It is opt-in, certificate-managed, and adds CPU and latency. Turn it on for anything that leaves a trusted lab network; budget for the key management.

DDS tuning

Out-of-the-box DDS settings are tuned for correctness on a small graph, not for your robot. The common production adjustments:

Increase OS socket buffers (net.core.rmem_max) — default Linux buffers drop large messages (point clouds, images) and you see mysterious loss.
Enable shared-memory transport for intra-host traffic (Fast DDS and Cyclone both support it) — huge win when many nodes on one machine exchange big messages.
Use a Discovery Server (Fast DDS) or Zenoh router once the graph is large or the network is flaky.
Tune QoS depths and history so you are not buffering megabytes of stale images.

Multi-machine and multi-robot

Same ROS_DOMAIN_ID, same network, multicast working — and nodes across machines discover each other automatically. The failure modes are network ones: multicast blocked by a managed switch or firewall, MTU mismatches fragmenting large messages, Wi-Fi roaming dropping discovery. For multi-robot, separate domains per robot and bridge only the topics that must cross (Zenoh's router model is increasingly the clean answer here, especially over WAN or cellular).

micro-ROS for MCUs

Not every node needs a Linux box. micro-ROS is a real ROS 2 client library for microcontrollers (STM32, ESP32, Teensy, and RTOSes like FreeRTOS, Zephyr, NuttX). The MCU runs rclc and talks to a micro-ROS agent on a Linux host, which bridges it into the full DDS graph. This is the right home for the kHz sensor sampling or the fast control loop — the determinism lives on the MCU, and it appears in your graph as just another node publishing sensor_msgs/Imu or subscribing to a setpoint.

Deployment

Real fleets containerize (Docker) for reproducible environments, pin the ROS distro and every dependency, and ship updates over the air. Lifecycle (managed) nodes give you deterministic bringup and shutdown — a node goes unconfigured → inactive → active under supervision, so you can configure all nodes, then activate them in order, instead of racing at startup. Use them for anything that must come up in a controlled sequence.

The honest pain points

DDS debugging is opaque. When discovery fails, the logs are not friendly. Budget for it.
QoS mismatches fail silently. The fix is fast once you know to check; finding it the first time is not.
Build and dependency hell. rosdep, version skew between your packages and the distro, and the source-vs-binary mix can eat a day.
The tuning tax on Nav2 and MoveIt 2 is real and recurring — every new robot footprint is a fresh tuning cycle.
Documentation drift. Tutorials lag the current distro; a snippet written for Humble may not work on Jazzy/Kilted unchanged.
Real-time is your problem, not ROS's. The framework hands you the tools; the determinism is your engineering.

Should you use ROS 2?

A decision framework, not a slogan:

Your situation	Verdict
One MCU, one control loop, no perception	No. Bare firmware (or micro-ROS only if you want graph integration later).
Mobile robot needing navigation, or any arm needing planning	Yes. Nav2/MoveIt 2 alone justify it.
Multi-sensor robot, multiple subsystems sharing data	Yes. The graph + standard messages are the whole point.
Hard-real-time safety loop as the core deliverable	Not as the RT layer. Use it above a deterministic layer (firmware/micro-ROS/`ros2_control` on PREEMPT_RT).
Shipping a product, small team, tight timeline	Probably yes, eyes open. You inherit a massive ecosystem; you also inherit DDS, QoS, and the tuning tax.
Research / prototyping / learning	Yes, easily. This is ROS 2's strongest case.

The honest bottom line: ROS 2 in 2026 is the default for good reasons — the decentralized graph, the DDS foundation, the production features, and an ecosystem (Nav2, MoveIt 2, ros2_control, micro-ROS) that would take years to rebuild. It does not make your robot real-time, it does not make networking simple, and it will charge you a tuning tax. Pin Jazzy for production, learn DDS and QoS before you need to, keep the hard loop below ROS, and you will spend your time on your robot instead of on the middleware.

Frequently asked questions

Is ROS 2 actually an operating system? No. It is middleware plus a build system, tooling, and an ecosystem, running on top of a real OS (usually Ubuntu Linux). The "OS" in the name is historical from the original Robot Operating System.

Which ROS 2 distro should I use in 2026? Jazzy Jalisco for production — it is the current LTS (released May 2024, supported to 2029) on Ubuntu 24.04. Kilted Kaiju (May 2025) is the newer non-LTS for early adopters, and Rolling is the always-latest development line. Pick the LTS unless you have a specific need for newer features, and match your Ubuntu LTS to your ROS distro.

Should I migrate my ROS 1 system to ROS 2? If it is anything beyond a frozen legacy system, yes — ROS 1 Noetic reached end of life in May 2025 and there will be no further releases or security fixes. New projects should start on ROS 2 directly. For migration, the ros1_bridge lets a ROS 1 and ROS 2 graph talk during a transition.

Why don't my messages arrive even though the publisher is running? The overwhelmingly likely cause is a QoS mismatch: a RELIABLE subscriber will not connect to a BEST_EFFORT publisher. Run ros2 topic info /your_topic -v and compare the QoS on both ends. Sensor topics are usually best-effort; subscribe with the sensor QoS profile.

Fast DDS vs Cyclone DDS vs Zenoh — which middleware? Start with the distro default (Fast DDS on Jazzy/Kilted). Switch to Cyclone DDS if you want lean, predictable latency and simple tuning on a single robot. Use rmw_zenoh when you have large graphs, multi-robot, or WAN/lossy links where classic DDS discovery struggles. Change it with one environment variable, RMW_IMPLEMENTATION.

Is ROS 2 real-time? Soft real-time, at best, and only with effort (PREEMPT_RT kernel, isolated CPUs, SCHED_FIFO, locked memory, no allocation in the loop, careful executor/QoS choices). It is not hard-real-time out of the box. Put genuinely hard loops in drive firmware, micro-ROS on an MCU, or ros2_control tuned for it.

What is the difference between a topic, a service, and an action? Topics are streaming, many-to-many, fire-and-forget pub/sub (sensor data, odometry). Services are synchronous request/response for quick queries. Actions are for long-running, cancelable goals with feedback (navigate to a pose, plan and execute a trajectory).

Do I have to use C++, or is Python fine? Both are first-class. Use rclpy for glue, configuration, prototyping, and non-time-critical nodes; use rclcpp for anything on a hot path or where you need real control over allocation and the executor. Most production robots run a mix.

What is ros2_control and do I need it? It is the hardware abstraction layer: a real-time controller manager running a read/update/write loop, hardware interfaces that talk to your drives, and swappable controllers (diff-drive, joint trajectory, etc.). Use it for any robot with actuators you command in a loop; it cleanly separates control logic from the specific hardware.

What is micro-ROS? A ROS 2 client library for microcontrollers (STM32, ESP32, Teensy) over RTOSes like FreeRTOS and Zephyr. The MCU runs a lean client and bridges into the full graph through a micro-ROS agent on a Linux host — ideal for fast, deterministic sensing and control loops that then appear as ordinary nodes.

How do I record and replay robot data? ros2 bag record -a captures topics (default format is .mcap); ros2 bag play replays them. It is your flight recorder — record field tests, replay failures at your desk into your perception and planning nodes to debug offline, and use bags for regression tests.

Why can't two nodes on the same machine see each other? Most often a different ROS_DOMAIN_ID, a forgotten workspace source (so one node isn't actually built/on the path), a QoS mismatch, or blocked multicast/loopback. Check the domain ID, confirm both overlays are sourced, then check QoS with ros2 topic info -v.