SLAM & Robot Localization: The Ultimate Guide

A robot driving across a warehouse has to answer one question continuously, dozens of times a second: where am I? If it gets that answer wrong by 30 cm it clips a rack; if it gets it wrong by 2 m it is lost. The frustrating part is that the obvious way to answer it — "compare what I see to the map" — assumes you already have a map. And the obvious way to build a map — "stitch together what I see from each known pose" — assumes you already know where you are. You need the pose to build the map and the map to find the pose. That circular dependency is SLAM.

This guide is about Simultaneous Localization and Mapping and its close cousin, localization against a known map. We will start from the state-estimation framing — the belief, the motion model, the observation model — then walk the three algorithmic families (EKF-SLAM, particle filters and FastSLAM, and modern factor-graph SLAM), the front-end/back-end split, scan matching, the real lidar and visual-inertial stacks engineers actually deploy (Cartographer, slam_toolbox, LIO-SAM, FAST-LIO2, ORB-SLAM3, VINS-Fusion, RTAB-Map, OpenVINS; AMCL for known-map localization), loop closure, map representations, the compute budget, the failure modes that will bite you, and how to choose.

The take: in 2026 the default for almost any new system is factor-graph (pose-graph) SLAM with a tightly-coupled front-end — lidar-inertial outdoors and on fast platforms, visual-inertial where weight and cost dominate — and you keep filters (EKF, particle filter) for two jobs only: fusing fast proprioceptive sensors into a smooth odometry stream, and Monte-Carlo localization against a map you already trust. The single biggest lever on accuracy is not the algorithm; it is sensor quality, calibration, and whether your environment gives the front-end something to latch onto. Most "SLAM is broken" tickets are really a featureless corridor, a bad extrinsic, or an IMU nobody calibrated.

Companion reading: LiDAR & depth cameras, robot sensors, mobile robots: AMRs & AGVs, motion planning & kinematics, ROS 2, and machine vision.

Key takeaways
The chicken-and-egg problem
Problem framing: state, models, and the belief
Odometry sources and drift
Filtering vs optimization: the great split
Front-end vs back-end
Scan matching and lidar SLAM stacks
Visual SLAM and visual-inertial odometry
Loop closure and place recognition
Map representations
The sensor and compute budget
Degeneracy and failure cases
2D vs 3D, indoor vs outdoor
Selecting a stack and the Nav2 tie-in
Frequently asked questions

Key takeaways

SLAM is a chicken-and-egg problem solved by estimating both at once. You jointly estimate the robot's trajectory and the map from noisy sensors. Treat them separately and the errors compound; estimate them together and the constraints between them cancel a lot of error.
It is fundamentally state estimation. Everything reduces to a belief — a probability distribution over where you are and what the world looks like — propagated by a motion model and corrected by an observation model. Pick the wrong noise models and no algorithm saves you.
All odometry drifts; the only question is how fast. Wheel odometry drifts on slip, IMU integration drifts as t² in position, visual and lidar odometry drift slower but still without bound. SLAM exists to bound that drift with loop closure and map constraints.
The field moved from filters to optimization. EKF-SLAM (O(n²) in landmarks) and particle-filter SLAM gave way to factor-graph / pose-graph SLAM because optimization over a sparse graph relinearizes, scales, and produces better maps. GTSAM, g2o, and Ceres are the backends.
Filters still own two niches. An EKF/UKF (e.g. robot_localization) fuses wheel + IMU + GPS into smooth high-rate odometry; a particle filter (AMCL/MCL) localizes against a known map. Neither is the right tool for building a large map from scratch anymore.
Front-end vs back-end is the architecture. The front-end turns raw sensor data into constraints (scan matches, feature tracks, loop detections); the back-end optimizes the graph of those constraints. Most failures are front-end failures.
Scan matching is the lidar workhorse. ICP (point-to-point/point-to-plane) and NDT register scans; the strong lidar stacks — Cartographer, slam_toolbox, LIO-SAM, FAST-LIO2 — wrap matching in a graph and, increasingly, fuse the IMU tightly.
Visual SLAM splits into feature-based and direct. ORB-SLAM3 (features + bag-of-words loop closure) is the reference; VINS-Fusion and OpenVINS are the visual-inertial standards. Tight IMU coupling beats loose coupling for robustness and scale observability.
Loop closure is what makes a map instead of a long drift. Recognizing a previously-visited place (bag-of-words / DBoW2, learned descriptors) adds a constraint that snaps the accumulated error back. Without it you have odometry with extra steps.
Maps cost memory, and the cost is geometric. A 2D occupancy grid at 5 cm over 100×100 m is ~4 MB; a 3D voxel grid at the same resolution over a building is gigabytes. Choose the representation (grid, point cloud, mesh, topological) for the consumer, not the sensor.
Degeneracy is the real enemy. Featureless corridors, symmetric rooms, glass, and dynamic scenes break the front-end. Detect degeneracy, lean on the IMU through it, and never trust a single-modality stack in an environment that can starve it.
2D indoor and 3D outdoor are different sports. Flat-floor AMRs want 2D lidar SLAM (slam_toolbox) + AMCL. Outdoor, uneven, or 6-DoF platforms want 3D lidar-inertial (FAST-LIO2/LIO-SAM) or VIO. Match the algorithm's assumptions to the world.
Mapping and localization are separate runtime modes. You usually map once (online or offline), freeze the map, then localize against it in production. The Nav2 stack expects exactly this split.

The chicken-and-egg problem

Start with the two operations a navigating robot needs and notice they each depend on the other.

Localization is "given a map, where am I?" You compare a sensor reading — a lidar scan, a camera image — to a known map and find the pose that best explains it. This is the easier problem, and it is what runs in production once you have a map.

Mapping is "given my poses, what does the world look like?" You take sensor readings from a sequence of known poses and fuse them into a consistent model. Also tractable — if you know the poses.

The trouble is that on a fresh deployment you have neither. You do not know where you are because you have no map, and you cannot build a map because you do not know where you are. Worse, the errors are correlated: an error in your estimated pose places the landmark you just observed in the wrong spot on the map, and then that wrong landmark corrupts the next pose estimate. Errors feed each other.

SLAM's insight is that you should not pick one to solve first. You estimate the trajectory and the map jointly, as one big coupled estimation problem, and you exploit the fact that the same landmark seen from multiple poses ties those poses together. Re-observing a landmark you mapped earlier is a constraint that pins down both the landmark and your current pose. Close a big loop — return to where you started — and a single constraint can correct hundreds of metres of accumulated drift across the whole trajectory at once.

Rule of thumb: if you have a trustworthy prior map and the environment is stable, you do not need SLAM — you need localization. Run SLAM to build the map, then switch to localization for production. Running full SLAM forever when a frozen map would do is a common and expensive mistake.

That is the whole game: SLAM is the bootstrapping phase that gets you a map and a trajectory at once; localization is what you do afterward. Most of this guide is about doing the bootstrapping well, because it is the hard part.

Problem framing: state, models, and the belief

Strip away the implementation and SLAM is a Bayesian state-estimation problem. There are four objects you must define before any algorithm means anything.

The state

The state x is everything you are estimating. At minimum it is the robot's pose: in 2D that is (x, y, θ) — three numbers; in 3D it is position plus orientation, six degrees of freedom (often carried as a 7-vector with a unit quaternion, or on the SE(3) manifold). In full SLAM the state also includes the map — landmark positions, or a whole pose history, depending on the formulation. In visual-inertial systems the state grows to include velocity, accelerometer bias, and gyroscope bias, because you cannot estimate pose from an IMU without estimating its biases too.

The motion model (prediction)

The motion model p(xₜ | xₜ₋₁, uₜ) says how the state evolves given a control or proprioceptive input uₜ — wheel encoder ticks, an IMU sample, a commanded velocity. It is your prediction step. It is also where odometry drift is born: the model is never exact, and the uncertainty it injects grows every timestep with no observation to correct it.

The observation model (correction)

The observation model p(zₜ | xₜ, map) says what measurement zₜ you expect to see from a given state and map — what a lidar beam should return, where a visual feature should project. When a real measurement arrives, you compare it to the prediction and use the mismatch (the innovation, or residual) to correct the state. This is the step that fights drift.

The belief

The belief bel(xₜ) = p(xₜ | z₁:ₜ, u₁:ₜ) is the full posterior — the probability distribution over the state given everything you have ever sensed and commanded. The entire field is different ways to represent and update this belief:

A Gaussian (mean + covariance) → Kalman-family filters.
A set of weighted samples (particles) → particle filters.
A maximum-a-posteriori point estimate from a graph of constraints → factor-graph SLAM.

Recursive Bayes filter (the skeleton under everything):

  predict:   bel⁻(xₜ) = ∫ p(xₜ | xₜ₋₁, uₜ) · bel(xₜ₋₁) dxₜ₋₁
  correct:   bel(xₜ)  = η · p(zₜ | xₜ) · bel⁻(xₜ)

  η = normalizer.  Predict grows uncertainty; correct shrinks it.

Rule of thumb: the noise models matter as much as the algorithm. If you feed an EKF a wheel-odometry covariance that is 10× too optimistic, it will trust odometry over good lidar corrections and drift confidently into a wall. Tuning the Q (process) and R (measurement) noise is not a detail — it is the job.

Everything below is a commitment to one belief representation and one way to run predict/correct cheaply enough for a real robot.

Odometry sources and drift

Odometry is dead reckoning: integrating motion to estimate pose. Every source drifts; understanding how each drifts tells you which to fuse and which to trust.

Wheel odometry. Integrate encoder ticks through a kinematic model. Cheap, high-rate (often 100–1000 Hz), and smooth — but it believes the wheels. Slip, skid, uneven tire diameter, and the dreaded kidnapped push corrupt it instantly, and the heading error integrates into unbounded position error. On a flat floor it is excellent for the short term and useless for the long term. See mobile robots for the drive geometries behind it.

Inertial (IMU). A gyro measures angular rate; an accelerometer measures specific force. Integrate the gyro once for orientation, the accelerometer twice for position. The double integration is brutal: a constant accel bias of just 0.01 m/s² grows into a position error of ½·0.01·t² ≈ 0.5 m after 10 s and 2 m after 20 s. Orientation drifts more slowly (single integration of gyro bias), and the gravity vector gives you an absolute roll/pitch reference, but heading (yaw) drifts freely without a magnetometer or external fix. IMUs are unbeatable for the very short term and high-frequency motion; they are why VIO and LIO work.

Visual odometry (VO). Track features (or pixel intensities) across frames and solve for the camera motion that explains the apparent motion. Drifts far slower than wheels or raw IMU, but accumulates scale drift (a single camera cannot observe absolute scale) and breaks in low texture, motion blur, and bad lighting. Fuse it with an IMU and the scale becomes observable — that is visual-inertial odometry.

Lidar odometry (LO). Register consecutive scans (ICP/NDT) to estimate motion. Geometrically accurate and metric (lidar measures real distance), robust to lighting, but degenerate where geometry is ambiguous — a long featureless corridor, a flat field — and heavy on compute. Fuse with an IMU → lidar-inertial odometry, the basis of LIO-SAM and FAST-LIO2.

Why double integration is the IMU's curse:

  accel bias b = 0.01 m/s²  (a good MEMS IMU, uncorrected)

  velocity error  = b · t
  position error  = ½ · b · t²

  t = 1 s  → 0.005 m      (fine)
  t = 10 s → 0.5 m        (clipping shelves)
  t = 60 s → 18 m         (lost)

  → The IMU MUST be corrected by an exteroceptive sensor.

Rule of thumb: no odometry source is good at everything. The IMU is great at high-frequency, short-term motion and terrible at low-frequency drift; lidar/vision are the reverse. Fusing them — fast IMU prediction, slower exteroceptive correction — is why modern inertial-aided stacks dominate. SLAM then bounds even the fused drift with loop closure.

Filtering vs optimization: the great split

There are two grand strategies for maintaining the belief. The history of SLAM is largely the migration from the first to the second.

EKF-SLAM

The original. Represent the belief as one big Gaussian over [robot pose, all landmark positions], and run an Extended Kalman Filter: linearize the nonlinear motion and observation models around the current estimate, predict, then correct on each landmark observation.

EKF predict/update sketch (state x, covariance P):

  predict:
    x⁻ = f(x, u)                 # nonlinear motion model
    P⁻ = F·P·Fᵀ + Q              # F = ∂f/∂x (Jacobian), Q = process noise

  update (observe landmark j):
    y  = z − h(x⁻)               # innovation (residual)
    S  = H·P⁻·Hᵀ + R             # H = ∂h/∂x, R = measurement noise
    K  = P⁻·Hᵀ·S⁻¹               # Kalman gain
    x  = x⁻ + K·y
    P  = (I − K·H)·P⁻

EKF-SLAM works and was the field's backbone into the 2000s, but it has a fatal scaling property: the covariance P is dense (every landmark becomes correlated with every other), so the update is O(n²) in the number of landmarks n. A few hundred landmarks is fine; tens of thousands is not. It also linearizes once per step around a possibly-wrong estimate and can never undo that linearization error, which makes it brittle on large loops. The UKF (unscented) variant avoids explicit Jacobians and handles nonlinearity better, but the O(n²) scaling and the single-pass linearization remain.

Particle filters and FastSLAM

A particle filter represents the belief as a cloud of weighted samples, each a hypothesis of the full state. Predict by pushing every particle through the motion model (with noise); correct by reweighting each particle by how well it explains the measurement; periodically resample to kill low-weight particles. No Gaussian assumption — it can represent multi-modal beliefs (e.g. "I'm either in room A or the identical room B"), which is exactly what global localization needs.

FastSLAM is the clever application to mapping: it factorizes the problem so each particle carries its own map of independent EKF-tracked landmarks (Rao-Blackwellization). It scales far better than EKF-SLAM and powered GMapping, the classic 2D grid SLAM. The catch is particle depletion: on a long loop the diversity collapses, the true hypothesis gets resampled away, and the map tears.

For localization against a known map, the particle filter is still the right tool — this is Monte-Carlo Localization (MCL), and AMCL (Adaptive MCL, the ROS standard) is its production form. It handles the multi-modal "where am I globally?" question and the kidnapped-robot recovery that a Gaussian filter cannot. AMCL is mapping's retired cousin: great for localization, not for building the map.

Factor-graph / pose-graph SLAM

The modern default. Do not maintain a running filtered estimate at all. Instead, accumulate every measurement as a constraint (factor) in a graph whose nodes are the things you want to estimate (poses, landmarks) and whose edges are the constraints between them (odometry between consecutive poses, a loop closure between distant poses, a landmark observation). Then solve for the configuration of all nodes that minimizes the total weighted residual — a big nonlinear least-squares problem.

Pose-graph optimization (the cost being minimized):

  X* = argmin_X  Σ_ij  rᵢⱼ(xᵢ, xⱼ)ᵀ · Ωᵢⱼ · rᵢⱼ(xᵢ, xⱼ)

  rᵢⱼ = error of edge (i,j):
        residual between the MEASURED relative transform zᵢⱼ
        and the one PREDICTED by current poses xᵢ, xⱼ
  Ωᵢⱼ = information matrix (inverse covariance) — how much to trust edge ij

  Solved by Gauss-Newton / Levenberg-Marquardt over the manifold SE(2)/SE(3).
  The Jacobian is SPARSE → exploit it (Cholesky) → scales to 10⁵+ nodes.

Why it won: it relinearizes every iteration (so it recovers from bad initial guesses where the EKF cannot), it is sparse (an odometry-and-loop graph is nowhere near fully connected, so factorization is fast), and it estimates the whole trajectory so a single loop closure corrects everything at once. The backends are mature and battle-tested: GTSAM (factor graphs, incremental solving via iSAM2), g2o (the classic general graph optimizer), and Ceres (Google's general nonlinear least-squares, used by Cartographer and VINS). iSAM2's incremental update is what makes graph SLAM real-time: it only re-solves the part of the graph a new factor actually touches.

Property	EKF-SLAM	Particle filter / FastSLAM	Factor-graph / pose-graph SLAM
Belief representation	Single Gaussian (mean + cov)	Weighted samples (particles)	MAP point estimate from a graph
Multi-modal?	No	Yes (its main strength)	No (single estimate)
Scaling in landmarks/poses	`O(n²)` (dense covariance)	`O(particles × map)`; depletion risk	Sparse, `O(n)`-ish; `10⁵+` nodes
Linearization	Once per step, never undone	N/A (sampling)	Relinearized every iteration
Loop closure handling	Poor on large loops	Causes depletion	Excellent — corrects whole trajectory
Recovers from bad init	Weakly	Yes (resampling)	Yes (re-optimization)
Best modern use	Sensor fusion (small state)	Known-map localization (AMCL)	Default for building maps
Real systems	`robot_localization` EKF	GMapping, AMCL/MCL	Cartographer, slam_toolbox, LIO-SAM, ORB-SLAM3, VINS

Rule of thumb: in 2026, build maps with a factor graph, fuse fast proprioceptive sensors with an EKF, and localize against a known map with a particle filter. Using an EKF to build a large landmark map, or a particle filter to map a whole building, is fighting the tooling.

Front-end vs back-end

Every serious SLAM system has two halves, and confusing them is how teams misdiagnose problems.

The front-end is perception and data association. It turns raw sensor data into constraints: it extracts features or keypoints, matches them across frames, runs scan matching to estimate relative motion, and — critically — decides which measurements correspond to which landmarks (data association) and whether the current view matches a past one (loop-closure detection). The front-end is sensor-specific (a lidar front-end and a camera front-end share almost no code) and it is where the hard, brittle decisions live.

The back-end is the optimizer. It takes the constraints the front-end produced and finds the trajectory and map that best satisfy them — the factor-graph optimization above, or the filter update. The back-end is mostly sensor-agnostic linear algebra; GTSAM does not care whether an edge came from a lidar or a camera.

The reason this split matters operationally:

Rule of thumb: the back-end is rarely your problem. Almost every SLAM failure in the field is a front-end failure — a bad scan match in a degenerate corridor, a wrong data association, or a false loop closure. A single false loop closure is catastrophic: it tells the optimizer two genuinely-distant places are the same, and the back-end faithfully folds your map in half.

This is why robust back-ends added outlier-rejection machinery: switchable constraints, dynamic covariance scaling, graduated non-convexity (GNC), and Cauchy/Huber robust kernels that let the optimizer down-weight a constraint that disagrees violently with everything else. They are insurance against the front-end's worst mistakes. But the right primary defense is a front-end that does not generate garbage: good features, geometric verification of loop candidates (RANSAC on the matched points), and consistency checks before a loop closure is allowed into the graph.

Scan matching and lidar SLAM stacks

Lidar SLAM starts from one operation: given two point clouds, find the rigid transform that aligns them. That is scan matching, and it is the lidar front-end's core.

ICP and NDT

Iterative Closest Point (ICP) alternates two steps until convergence: (1) for each point in scan B, find the closest point in scan A; (2) solve for the transform that minimizes the summed distances; repeat. Point-to-point ICP minimizes point distances; point-to-plane ICP minimizes the distance from each point to the local surface tangent of its match, which converges faster and is the practical default for structured environments. ICP is accurate when the initial guess is good (feed it the IMU or odometry prior) and fragile when it is not — it falls into local minima and needs a decent prior to seed it.

Normal Distributions Transform (NDT) takes a different tack: voxelize the reference cloud and model each voxel as a Gaussian, then align the new scan by maximizing the likelihood of its points under that field of Gaussians. NDT is smoother (it optimizes a continuous, differentiable cost rather than discrete correspondences), often more robust to a poor initial guess, and a common choice for outdoor automotive lidar registration.

The 2D stacks

slam_toolbox is the 2D lidar SLAM default in ROS 2 today. It is pose-graph SLAM: scan matching for odometry, a graph back-end (Ceres) for optimization, and a scan-matching loop-closure detector. Crucially it supports lifelong mapping — load a saved graph, keep mapping, and serialize the pose graph so you can re-localize and continue later. For a flat-floor indoor AMR it is the safe, well-supported choice, and it cleanly hands off to AMCL for production localization.

Cartographer (originally Google) is the other heavyweight, available in 2D and 3D. Its architecture is distinctive: the front-end builds small submaps (each a little local occupancy grid) by scan-matching incoming scans into the current submap; the back-end runs branch-and-bound scan matching to detect loop closures against all finished submaps, then optimizes a sparse pose graph (Ceres) over submap and scan poses. The submap design makes loop closure efficient and the maps crisp. It is heavier to tune than slam_toolbox but produces excellent results, and it handles 3D backpack/handheld mapping well.

The 3D inertial stacks

For 3D, fast, or 6-DoF platforms, the modern systems couple the lidar with the IMU tightly.

LIO-SAM (lidar-inertial odometry via smoothing and mapping) is a factor-graph system built on GTSAM. It pre-integrates IMU between lidar keyframes for a strong motion prior, extracts edge and planar features (LOAM-style), scan-matches against a local map, and adds IMU pre-integration factors, lidar odometry factors, optional GPS factors, and loop-closure factors to the graph. It is accurate and a strong outdoor/ground-vehicle choice, and the GPS factor makes geo-referenced mapping straightforward.

FAST-LIO2 is the efficiency benchmark. It is a tightly-coupled iterated EKF (not a graph) that — and this is the key idea — registers raw points directly against the map with no feature extraction, using an incremental k-d tree (ikd-Tree) to keep the map queryable in real time. The math (a clever Kalman gain formulation) makes the EKF update cost scale with measurement dimension rather than state dimension, so it runs at high rate on modest compute, even on a small embedded CPU. It is odometry-grade (no built-in large-loop closure), so people pair it with a separate loop-closure/pose-graph layer when they need a globally consistent map. If you need real-time 3D state estimation on a drone or quadruped with limited compute, FAST-LIO2 is the one to beat. See legged robots for those platforms.

System	Dim	Approach	IMU coupling	Loop closure	Backend	Best for
slam_toolbox	2D	Pose-graph, scan match	None (uses odom)	Yes (scan match)	Ceres	Indoor flat-floor AMRs; lifelong mapping
Cartographer	2D/3D	Submaps + branch-and-bound	Optional	Yes (vs submaps)	Ceres	Crisp maps, handheld/backpack 3D
LIO-SAM	3D	Feature LIO, factor graph	Tight (pre-integration)	Yes	GTSAM	Outdoor ground vehicles, geo-referenced
FAST-LIO2	3D	Direct LIO, iterated EKF	Tight	No (add a layer)	iEKF + ikd-Tree	Real-time on light compute (drones, legged)

Rule of thumb: for a flat indoor robot, slam_toolbox. For 3D on real compute with loops you care about, LIO-SAM. For 3D on a weight/compute budget, FAST-LIO2 plus a separate loop-closure layer. Cartographer when you want the cleanest maps and will pay the tuning cost.

Visual SLAM and visual-inertial odometry

Cameras are cheap, light, low-power, and information-dense — and that is exactly why visual SLAM is harder than lidar SLAM. A camera gives you bearing but not range (a monocular camera cannot see scale at all), it dies in the dark and in low texture, and motion blur destroys it. The payoff is rich data for loop closure and a sensor that costs and weighs almost nothing. See machine vision for the imaging fundamentals.

Feature-based vs direct

Feature-based methods detect repeatable keypoints (ORB, SIFT-like), describe them, match them across frames, and optimize camera poses and 3D point positions to minimize reprojection error (the pixel distance between where a 3D point projects and where it was observed). They throw away most of the image and keep a sparse set of robust points. Fast, mature, and good at loop closure (the descriptors double as a place-recognition vocabulary).

Direct methods (LSD-SLAM, DSO) skip features entirely and optimize photometric error — the raw intensity difference — over (semi-)dense pixels. They use more of the image, handle low-texture scenes where features are sparse, and produce denser maps, but they are sensitive to brightness changes, rolling shutter, and need good photometric calibration. In production, feature-based has been the more robust workhorse; direct is excellent where it fits.

ORB-SLAM3 is the reference feature-based system, and it is genuinely good: monocular, stereo, and RGB-D; with or without an IMU (it is a full visual-inertial system too); a multi-map system (Atlas) that can lose tracking, start a new map, and later merge maps when it recognizes a connection; and DBoW2 bag-of-words loop closure and relocalization. If you want to understand visual SLAM, read ORB-SLAM3.

Visual-inertial odometry (VIO)

A monocular camera cannot observe scale or absolute roll/pitch; an IMU can (gravity gives roll/pitch, acceleration gives metric scale). Fuse them and you get VIO — metric, gravity-aligned, robust to the brief moments the camera fails (blur, a passing truck). VIO is the workhorse for drones, AR/VR headsets, and weight-constrained robots.

The central design choice is coupling:

Loose coupling runs the visual estimator and the IMU estimator separately and fuses their outputs (e.g. a VO pose into an EKF that also integrates IMU). Simpler, modular, but it throws away cross-information and is less robust.
Tight coupling puts raw IMU pre-integration and visual feature measurements into one estimator (one factor graph or one filter) and solves jointly. More accurate, better at recovering scale and biases, more robust to degeneracy — and the clear winner for serious systems.

VINS-Fusion (HKUST) is the standard tightly-coupled, optimization-based VIO — monocular-inertial, stereo, stereo-inertial; sliding-window nonlinear optimization (Ceres) with IMU pre-integration, plus a separate pose-graph loop-closure module (DBoW2). It is the system most VIO work is compared against.

OpenVINS is the standard tightly-coupled, filter-based VIO — a Multi-State Constraint Kalman Filter (MSCKF). It is lighter than full optimization, extremely well-documented, and a favorite research and embedded baseline. The MSCKF trick is to keep a sliding window of past poses in the state and marginalize features cleverly, getting most of optimization's accuracy at filter cost.

RTAB-Map is the pragmatic, batteries-included option: an RGB-D / stereo graph-SLAM system with strong appearance-based loop closure and built-in memory management (it pages old parts of the map out of working memory to stay real-time on big maps). It is less a research baseline and more the thing you reach for when you have an RGB-D camera and want a dense map and ROS integration without assembling a stack yourself.

System	Type	Sensors	Coupling	Loop closure	Notes
ORB-SLAM3	Feature, optimization	Mono/stereo/RGB-D (+IMU)	Tight (VI mode)	DBoW2 + multi-map merge	The reference visual SLAM
VINS-Fusion	Feature, optimization	Mono/stereo (+IMU)	Tight	DBoW2 (separate module)	The VIO optimization standard
OpenVINS	Feature, filter (MSCKF)	Mono/stereo + IMU	Tight	Limited (it's odometry)	Light, well-documented VIO baseline
RTAB-Map	Feature, graph	RGB-D / stereo (+lidar)	Loose-ish	Appearance-based, strong	Batteries-included, dense maps, memory mgmt
DSO / LSD-SLAM	Direct	Mono	—	LSD: yes; DSO: no	Dense-ish, low-texture-tolerant, calib-sensitive

Aspect	Lidar SLAM	Visual / VI SLAM
Range info	Direct, metric	Bearing only (mono); metric with stereo/RGB-D/IMU
Lighting	Indifferent (active)	Fails in dark / strong texture changes
Texture dependence	Needs geometry, not texture	Needs texture, not geometry
Degenerate case	Featureless corridor, open field	Blank wall, low light, motion blur
Loop closure	Geometric (scan/submap match)	Appearance (bag-of-words) — very strong
Cost / weight / power	Higher (esp. 3D lidar)	Low (camera + IMU is cheap and light)
Map richness	Geometry, sparse semantics	Dense texture, semantics-friendly
Typical platform	AMRs, AGVs, AVs, large robots	Drones, AR/VR, humanoids, cost-sensitive

Rule of thumb: if you can afford the lidar and weight, lidar-inertial is the more robust map-builder. If weight, cost, or power rule out lidar — drones, headsets, consumer robots — go visual-inertial and couple the IMU tightly. The best fielded systems on big robots fuse both, so each covers the other's degenerate case. See LiDAR & depth cameras for the fusion argument.

Loop closure and place recognition

This is the single feature that separates SLAM from "odometry that draws a map." Without it, your trajectory drifts steadily and the map smears; you can drive a perfect square and have the start and end points 3 m apart. Loop closure recognizes that you have returned to a previously-visited place and adds a constraint tying the current pose to the old one. The back-end then redistributes the accumulated error across the whole loop, snapping the map into consistency.

The hard part is recognizing the place — place recognition — fast and without false positives.

Bag-of-words (visual). The classic approach (DBoW2, used by ORB-SLAM3 and VINS) quantizes feature descriptors into a precomputed visual "vocabulary," so each image becomes a sparse histogram of visual words. Comparing two images is then a fast vector comparison, and you can index thousands of past keyframes and query them in milliseconds. A candidate match is then geometrically verified (match the actual features, run RANSAC, require enough consistent inliers) before it is allowed to become a loop-closure constraint. The verification step is non-negotiable — bag-of-words alone produces perceptual-aliasing false matches.

Geometric loop detection (lidar). Lidar stacks detect loops by scan/submap matching against past poses (Cartographer's branch-and-bound, LIO-SAM's radius search + ICP) or with global descriptors like Scan Context that summarize a 3D scan into a rotation-invariant signature for fast candidate retrieval. Same pattern: cheap candidate retrieval, then expensive geometric verification.

Rule of thumb: be conservative with loop closures. A missed loop closure costs you some drift you can fix on the next pass; a false loop closure corrupts the entire map irreversibly. Require strong geometric verification and use robust back-end kernels (GNC, switchable constraints) as a second line of defense.

The deep enemy of place recognition is perceptual aliasing — different places that look identical (every aisle in a warehouse, every floor of a parking garage, a row of identical office doors). This is exactly where appearance-based recognition produces confident false matches, and it is why symmetric and repetitive environments are so hard. Learned global descriptors (NetVLAD and successors) are more robust to viewpoint and lighting change than classic bag-of-words and are increasingly common in 2026 stacks, but they do not eliminate aliasing — geometric verification still has the final say.

Map representations

The map is not an afterthought; its representation determines what your planner can do and what your robot can afford to store. Choose it for the consumer — the planner, the localizer, the human — not for the sensor.

Occupancy grid (2D / 3D). Discretize space into cells, each holding the probability it is occupied. The standard for 2D navigation and the input AMCL and most 2D planners expect. Simple, supports ray-casting for localization, and directly answers "is this cell free?" The cost is memory, and in 3D it explodes. OctoMap mitigates the 3D cost with an octree that stores free/occupied space at adaptive resolution (large empty regions collapse to one node).

Point cloud. The raw-ish output of lidar/depth SLAM — a set of 3D points, optionally with intensity or color. Dense, accurate, great for 3D registration and visualization, but unstructured (no explicit free space, no connectivity) and heavy. Most 3D lidar SLAM maps are point clouds; you down-sample (voxel grid) hard before storage.

Mesh / surfel. Reconstruct surfaces as triangles or oriented disks (surfels). Compact for surfaces, great for rendering, manipulation, and human consumption — and the natural output of dense RGB-D fusion (TSDF-based methods). More processing to build and maintain.

Topological / semantic. A graph of places and connections ("kitchen → hallway → lab") rather than metric geometry. Tiny, robust to metric error, ideal for high-level task planning and very large environments — but you cannot servo to a millimetre with it. The strong systems are hybrid: metric maps locally, topological structure globally.

Occupancy-grid memory math (why 3D hurts):

  2D grid, 5 cm resolution, 100 m × 100 m:
    cells = (100/0.05)² = 2000 × 2000 = 4,000,000
    @ 1 byte/cell (8-bit log-odds) ≈ 4 MB        # trivial

  3D dense grid, 5 cm resolution, 100 m × 100 m × 10 m:
    cells = 2000 × 2000 × 200 = 800,000,000
    @ 1 byte/cell ≈ 800 MB                        # painful
    @ 5 cm over a 200 m × 200 m × 20 m site ≈ 12.8 GB   # unworkable dense

  → 3D wants an octree (OctoMap): empty space collapses, so the
    real cost scales with SURFACE area, not volume — often 10–100× smaller.

Representation	Memory	Free space?	Planner fit	Best for
2D occupancy grid	Low (MBs)	Explicit	Excellent (2D)	Flat-floor indoor navigation, AMCL
3D occupancy / OctoMap	Medium (octree)	Explicit	Good (3D)	3D collision checking, aerial/legged
Point cloud	High	No	Poor directly	Registration, 3D viz, source for other maps
Mesh / surfel	Medium (surface)	Surface only	Manipulation/render	Dense reconstruction, AR, grasping
Topological / semantic	Tiny	Abstract	High-level only	Task planning, very large environments

Rule of thumb: localize against a compact map (2D grid, sparse landmarks), plan against an occupancy map, and keep the dense point cloud only if a downstream consumer (manipulation, inspection, reconstruction) actually needs it. Carrying a full dense cloud around just to navigate a flat floor is wasted memory and CPU.

The sensor and compute budget

SLAM is a real-time system competing for the same CPU as perception, planning, and control. The budget is real, and it is where elegant algorithms meet shipping deadlines.

Sensors set the ceiling. No algorithm recovers information the sensors did not capture. A good IMU (low bias instability, e.g. an industrial-grade MEMS at a few deg/hr) is worth more to a VIO/LIO system than a fancier optimizer on a cheap IMU. A 3D lidar at 1.3–2.6 M points/s, a global-shutter camera (rolling shutter wrecks VIO unless modeled), and time-synchronized, calibrated sensors are the foundation. The two most common silent killers: an uncalibrated camera-IMU extrinsic (the rigid transform between them) and unsynchronized timestamps. A few-millisecond timing offset between camera and IMU degrades VIO badly; SLAM systems include online time-offset estimation precisely because this is so common.

Rule of thumb: spend the calibration effort before you blame the algorithm. Intrinsics, extrinsics, and time synchronization account for a large fraction of "this SLAM system is bad" reports. Kalibr-style calibration for VIO and a careful extrinsic for LIO are not optional.

Compute splits front-end and back-end. The front-end (feature extraction, scan matching) runs every frame and must keep up with the sensor rate; the back-end (graph optimization, loop closure) can run slower and asynchronously. This is why systems separate them onto different threads — the odometry stays real-time while the optimizer catches up in the background. FAST-LIO2 exists largely because that front-end loop must fit on small compute; ORB-SLAM3 and VINS run the heavy optimization in a back thread so tracking never stalls.

Rough numbers (2026, order-of-magnitude, platform-dependent):

2D lidar SLAM (slam_toolbox): comfortable on a modern quad-core ARM/x86; modest RAM.
VIO (OpenVINS/VINS): real-time on an embedded x86 or a Jetson-class board; tight but feasible.
3D LIO (FAST-LIO2): designed to run on a single modern CPU core at lidar rate; LIO-SAM wants more for the graph.
Dense reconstruction (TSDF/mesh): wants a GPU.

See real-time control for how SLAM coexists with the deterministic loops it must not starve, and robot sensors for the upstream sensing.

Degeneracy and failure cases

Knowing how SLAM breaks is more useful than knowing how it works, because the breakage is where your robot ends up against a wall.

Featureless / geometrically degenerate environments. A long, straight, featureless corridor is the textbook lidar killer: scans constrain your lateral position and heading but say nothing about how far you have travelled along it — the problem is under-constrained in one direction. The scan matcher slides freely and reports false confidence. Open fields, tunnels, and large flat walls do the same. The defense is to detect degeneracy (monitor the conditioning of the optimization — small eigenvalues of the information matrix flag an unobservable direction) and lean on the IMU/wheel odometry through it.

Textureless / low-light scenes (visual). Blank walls, white-out fog, darkness, and uniform surfaces starve a feature tracker. Direct methods help a little; an IMU helps a lot (it coasts through brief outages); but a camera-only system in a dark featureless space is simply blind.

Dynamic scenes. SLAM's core assumption is a static world. People, forklifts, other robots, and opened doors violate it. Features tracked on a moving object pull your pose estimate with them, and moving objects get baked into the map as phantom obstacles. Defenses: detect and reject dynamic objects (semantic segmentation, RANSAC outlier rejection treating movers as outliers), use short map memory so transients fade, and weight the static structure. A busy warehouse aisle at shift change is a genuinely hard case.

Perceptual aliasing. Covered above — repetitive environments fool place recognition into false loop closures. The most dangerous failure because it corrupts the whole map, not just the current pose.

The kidnapped-robot problem. The robot is picked up and moved (or the localizer simply loses track). A filter that has converged to a tight Gaussian around the wrong pose cannot recover — it is too confident. This is precisely why AMCL is a particle filter with injected random particles and adaptive sampling: it keeps enough hypothesis diversity to re-converge when the world contradicts it. Pure dead-reckoning has no recovery at all.

Glass and mirrors. Lidar passes through glass (no return, or a return from beyond it) and sees a mirror as a tunnel into a false room; cameras see reflections as real geometry. Both corrupt the map. Mark known glass, or fuse a sensor that sees it (some radar, ultrasonic).

Rule of thumb: never deploy a single-modality SLAM stack in an environment that can starve that modality. The cheapest robustness upgrade is almost always a well-calibrated IMU tightly coupled to your primary sensor — it carries you through the brief degeneracies that would otherwise lose the pose.

2D vs 3D, indoor vs outdoor

The right stack depends on the dimensionality of the world your robot actually lives in.

2D, indoor, flat floor. An AMR on a warehouse or hospital floor moves in (x, y, θ). A 2D lidar at sensor height plus a 2D occupancy grid is the mature, cheap, robust answer: slam_toolbox to build the map, AMCL to localize against it in production. Do not pay for 3D you do not use. The one caveat: a 2D lidar at a fixed height is blind to overhangs and low obstacles — pair it with a depth camera for obstacle avoidance even if SLAM stays 2D. This is the bread-and-butter case for most of the robots in the AMR/AGV guide.

3D, outdoor or uneven. The moment the robot pitches and rolls — outdoor terrain, ramps, stairs, drones, legged platforms — you need full 6-DoF state, a 3D lidar or VIO, and an IMU. The ground is not a plane, gravity is not always "down" in the body frame, and a 2D assumption produces nonsense. FAST-LIO2 / LIO-SAM for lidar platforms, VINS/OpenVINS for visual ones.

Outdoor adds GPS/GNSS. Outdoors you usually have a global fix (GNSS, RTK for centimetre accuracy), which changes the problem: you no longer need loop closure to bound global drift because GPS provides absolute position directly. The modern pattern is to fuse GNSS as a factor in the graph (LIO-SAM's GPS factor) — local lidar/visual SLAM for smooth, high-rate, locally-consistent motion; GNSS for the global anchor that kills long-term drift. Indoors you have no such anchor, which is exactly why indoor SLAM leans so hard on loop closure.

Rule of thumb: match the algorithm's dimensional assumptions to the physical world. A 2D stack on a flat floor is a feature (simple, robust, cheap); a 2D stack on a robot that pitches is a bug. And if you have GNSS, use it — an absolute anchor is worth more than the cleverest loop-closure detector.

Selecting a stack and the Nav2 tie-in

Put it together as a decision procedure rather than a popularity contest.

1. Do you even need SLAM, or just localization? If the environment is stable and you can map it once, map it once (online or from a recorded bag), freeze the map, and run localization in production. This is the common production architecture and it is far more robust than mapping forever.

2. 2D or 3D? Flat floor, planar motion → 2D. Pitch/roll, terrain, flight, stairs → 3D and an IMU.

3. What's your primary exteroceptive sensor and budget?

2D lidar, indoor, cost-conscious → slam_toolbox (build) + AMCL (localize).
3D lidar, real compute, want loops/geo-reference → LIO-SAM.
3D lidar, tight compute/weight (drone, legged) → FAST-LIO2 (+ a loop-closure layer).
Camera + IMU, weight/cost dominate → VINS-Fusion or OpenVINS; ORB-SLAM3 if you want maps + relocalization.
RGB-D, want a dense map with minimal assembly → RTAB-Map.
Cleanest 2D/3D maps, willing to tune → Cartographer.

4. Always add the IMU. Across every modern stack, tightly coupling a calibrated IMU is the highest-ROI robustness improvement. Budget for the calibration.

The Nav2 / ROS 2 tie-in

In a ROS 2 navigation system the pieces have clean, standardized seams, and SLAM slots into a well-defined place:

Mapping mode: run slam_toolbox (or Cartographer) → it publishes the map → odom transform and a nav_msgs/OccupancyGrid, and your wheel/inertial odometry publishes odom → base_link via an EKF (robot_localization). Save the map.
Localization mode: load the saved map, run AMCL → it corrects the map → odom transform by matching live scans to the frozen map. Your odometry source still provides the smooth odom → base_link.
The TF tree is the contract. map → odom → base_link → sensors. SLAM/AMCL own map → odom (the drift correction); odometry owns odom → base_link (smooth, high-rate, drifting); the URDF owns the rest. Nav2's costmaps, planners, and controllers consume the result. See the ROS 2 guide for the TF and the motion planning guide for what the planner does with the map.

The honest bottom line: SLAM in 2026 is a solved-enough problem that you should almost never write your own. Pick the stack that matches your dimensionality, sensor, and compute; couple the IMU tightly; spend the calibration and synchronization effort up front; map once and localize in production; and treat loop closure as something to do carefully, not aggressively. Do that and you will spend your engineering on your robot's actual job, not on rediscovering why the corridor ate your pose.

Frequently asked questions

What is the difference between SLAM and localization? Localization assumes you already have a map and answers "where am I in it?" SLAM builds the map and estimates your trajectory at the same time, with no prior map. In practice you run SLAM once to build the map, freeze it, then run localization (e.g. AMCL) in production.

Is SLAM a solved problem? For common cases — indoor flat-floor 2D, well-lit visual-inertial, 3D lidar in feature-rich environments — yes, with mature open-source stacks. It is not solved for long-term operation in highly dynamic, changing, perceptually-aliased, or sensor-degenerate environments. Lifelong SLAM and robustness to change are still active problems.

EKF-SLAM, particle filter, or graph SLAM — which should I use? For building maps, graph (factor-graph/pose-graph) SLAM is the modern default. Use a particle filter for localizing against a known map (AMCL/MCL), where its multi-modal belief handles global localization and the kidnapped-robot problem. Use an EKF/UKF for fusing fast proprioceptive sensors (wheel + IMU + GPS) into smooth odometry, not for mapping.

Why does loop closure matter so much? Without it, every SLAM system is just odometry that draws a map, and odometry drifts without bound — you can return to your start and be metres off. Loop closure recognizes a revisited place and adds a constraint that lets the optimizer redistribute accumulated error across the whole loop, snapping the map into global consistency.

Do I need a lidar, or is a camera enough? A camera plus a well-calibrated IMU (visual-inertial) is enough for many robots and is far cheaper and lighter — it is the standard for drones, headsets, and cost-sensitive platforms. Lidar is more robust (active, metric, lighting-indifferent) and better in low texture, at the cost of price, weight, and power. Big robots that can afford both fuse them.

What is the difference between front-end and back-end? The front-end turns raw sensor data into constraints (feature tracking, scan matching, data association, loop detection) and is sensor-specific. The back-end optimizes the graph of those constraints (GTSAM, g2o, Ceres) and is mostly sensor-agnostic. Most real-world SLAM failures are front-end failures, especially false loop closures.

Why does my robot's pose drift even with SLAM running? Between loop closures, the back-end can only do as well as the odometry constraints, so some drift is expected on the open trajectory. Persistent or large drift usually means a degenerate environment (featureless corridor), a bad sensor calibration/extrinsic, an uncalibrated or noisy IMU, or no loop closures being detected. Check calibration and synchronization first.

What is the kidnapped-robot problem? The robot is moved without odometry registering it, or the localizer otherwise loses track. A converged Gaussian filter is too confident to recover. AMCL is a particle filter specifically because injecting random particles and adaptive sampling let it re-converge when sensor data contradicts its current belief — that hypothesis diversity is the recovery mechanism.

Why is a featureless corridor so hard for lidar SLAM? The problem becomes under-constrained: scans pin down your lateral position and heading but provide no information about distance travelled along the corridor, so the scan matcher slides freely with false confidence. Detect the degeneracy (small eigenvalues in the information matrix) and rely on IMU/wheel odometry through it.

How much memory does a SLAM map need? A 2D occupancy grid is cheap — a 100×100 m area at 5 cm is about 4 MB. A dense 3D voxel grid explodes — the same footprint with 10 m of height is hundreds of MB, and a large site is unworkable dense, which is why 3D uses octrees (OctoMap) that collapse empty space and scale with surface area, not volume.

Loose vs tight coupling in visual-inertial systems? Loose coupling fuses the outputs of separate visual and inertial estimators (simpler, less accurate). Tight coupling puts raw IMU and visual measurements into one estimator and solves jointly (more accurate, better scale/bias observability, more robust to brief outages). Serious VIO systems — VINS-Fusion, OpenVINS, ORB-SLAM3's VI mode — are all tightly coupled.

How does SLAM fit into ROS 2 and Nav2? SLAM (slam_toolbox/Cartographer) publishes the map and owns the map → odom transform during mapping; AMCL owns it during localization against a saved map. Your odometry (an EKF over wheel + IMU) owns odom → base_link. Nav2's costmaps, planners, and controllers consume the resulting map and TF tree. The standardized TF contract is what lets these pieces swap cleanly.