Machine Vision for Automation: The Ultimate Guide

Q: Why is lighting considered the most important part of a vision system?

Because lighting determines whether the feature is *visible and stable* in the image, and no algorithm — classic or neural — can reliably recover a feature the imaging never captured. The right geometry (backlight, dome, dark-field) and wavelength turn a hard problem into a trivial one, while the wrong light injects glare and variability that no software fully fixes. It is also the cheapest fix on the station, which is why experienced engineers treat it as ~80% of the job.

Half the machine-vision jobs that fail were lost before anyone opened a software package. The camera was a third too small for the feature, the lens threw 2% perspective error into a measurement that needed 0.1%, or — most often — somebody bolted a ring light onto a shiny part and spent three weeks fighting glare that a backlight would have killed in an afternoon. Vision is an optics-and-lighting discipline that happens to involve a computer, and the engineers who treat it the other way around are the ones who end up rewriting tolerances at 2 a.m.

This guide is about 2D machine vision for the factory: locating, measuring, inspecting, and reading. We will go deep on the hardware — image sensors and why global shutter matters the instant your part moves, the camera interfaces (GigE Vision, USB3 Vision, Camera Link, CoaXPress) and what they actually buy you in bandwidth and cable length, optics including the telecentric lenses that make sub-0.05 mm metrology possible, and the lighting techniques that separate a robust deployment from a flaky one. Then the system: calibration and the pixels-to-millimetres chain, classic algorithms vs deep-learning inspection, robot guidance and hand-eye calibration, and the PLC handshake that makes it all part of a line.

The take: machine vision is a systems problem where lighting and optics dominate the outcome, and the single most expensive mistake is choosing a camera resolution before you have done the feature-size-to-pixels math. Get the imaging right and a 1990s blob tool will out-perform a state-of-the-art neural network fed garbage frames. Spend your design effort where the physics is — sensor, lens, light — and the algorithm choice becomes the easy part.

Companion reading: LiDAR & depth cameras, industrial robot arms, industrial automation (PLC/SCADA/fieldbus), and motion planning & kinematics.

Key takeaways
What machine vision actually is
The vision system anatomy
Image sensors: CMOS, shutters, format
Industrial cameras and interfaces
Optics and lenses
Lighting: the most neglected 80%
The measurement chain and calibration
Vision processing: classic vs deep learning
Robot and vision integration
Triggering, timing, and the PLC interface
Applications and accuracy expectations
Designing and selecting a vision system
Frequently asked questions

Key takeaways

Machine vision = automated imaging + a decision. It does four jobs: locate (where is the part), measure (gauging), inspect (defect/presence), and identify/read (OCR, barcodes, DataMatrix). Everything else is plumbing around those four.
Lighting is 80% of the battle. The cheapest reliability win on any vision job is the right light geometry and wavelength. A backlight turns a hard edge-measurement into a trivial silhouette; a ring light turns a shiny part into a glare nightmare. Spend here first.
Global shutter for anything that moves. Rolling shutter skews moving objects and smears at high line speed; for parts on a conveyor or robot end-of-arm cameras, global shutter (Sony Pregius IMX2xx/IMX5xx family) is non-negotiable.
The resolution math comes before the camera. Pixels-per-millimetre is set by PPM = sensor_px / FoV_mm, and you need roughly 3–5 px across the smallest feature you must detect. Pick the feature, do the math, then buy the sensor — never the reverse.
Telecentric lenses kill perspective error for metrology. A standard (entocentric) lens magnifies near objects more than far ones, so a part that shifts in working distance changes apparent size by percent. A telecentric lens holds magnification constant across its depth of field — essential for sub-0.05 mm gauging.
Interface follows bandwidth and cable length. GigE Vision (~~115 MB/s, 100 m) for distance and multi-camera; USB3 Vision (~~350 MB/s, ~3–5 m) for cheap single cameras; CoaXPress (up to 12.5 Gbit/s/lane, ~40 m) and Camera Link for high-speed line scan. Compute the data rate first: bytes/s = W × H × bytes/px × fps.
Smart camera vs PC-based is a complexity decision. A Cognex In-Sight or Keyence smart camera is fastest to deploy for one or two checks; a PC-based VisionPro/Halcon system wins on many cameras, heavy compute, and deep learning.
Classic vision still beats deep learning when the rules are clean. A measured dimension, a present/absent check, a read barcode — use blob, edge, pattern-match, OCR. Save CNNs for cosmetic/surface defects where "defect" is hard to define in pixels.
Sub-pixel is real and it is the difference between pass and scrap. Good edge and pattern tools resolve to ~1/10–1/40 of a pixel, so 5 µm/px imaging can support ~0.5–2 µm repeatability — if your optics, lighting, and calibration cooperate.
Calibration converts pixels to the world. A grid/dot-target calibration removes lens distortion and gives a pixels-to-mm map and, for guidance, the camera-to-robot transform (hand-eye). Skip it and your "measurement" is a number with no units.
Robot guidance needs hand-eye calibration. Eye-in-hand (camera on the wrist) or eye-to-hand (fixed overhead). Either way you solve for the rigid transform between camera frame and robot frame — see motion planning & kinematics and industrial robot arms.
The line owns the timing. A hardware trigger fires the camera, exposure freezes the part, the result handshakes to the PLC over digital I/O or fieldbus. Throughput is gated by exposure + readout + processing, not by your CPU's marketing number — see industrial automation and real-time control.
3D is a different guide. This is 2D inspection/guidance/measurement. For depth, point clouds, and 3D bin-picking, see LiDAR & depth cameras.

What machine vision actually is

Machine vision is automated imaging followed by an automated decision. A camera forms an image, software extracts a measurement or classification from it, and the system acts on that result — accept, reject, locate, report — without a human in the loop. The "decision" part is what separates it from photography and, frankly, from a lot of what gets called computer vision in research papers.

It is worth being precise about the boundary. General computer vision is the broad academic field — recognize a cat, caption a scene, drive a car. Machine vision is the industrial subset: constrained scene, controlled lighting, known part, a yes/no or a number, and a hard cycle-time budget. The constraints are a gift. You own the lighting, you own the part presentation, you usually know roughly where the feature is — and that lets a well-built machine-vision system hit 99.9%+ reliability where an unconstrained CV model would struggle to clear 90%.

The four tasks

Nearly every job decomposes into one or more of four primitives:

Locate — find a part or feature and report its position and angle. This feeds robot guidance, alignment, and downstream tools that need a frame of reference. The workhorse here is pattern matching (geometric, rotation/scale invariant).
Measure (gauge) — extract a dimension: a diameter, a gap, an angle, a position-to-tolerance. This is metrology, and it is where optics and calibration matter most. Accuracy targets of ±0.01–0.1 mm are common.
Inspect — judge condition: present/absent, defect/no-defect, correct/incorrect assembly, scratch, contamination, fill level. Ranges from trivial (is the cap there?) to genuinely hard cosmetic defect detection.
Identify / read — decode a 1D barcode, a 2D DataMatrix or QR, read printed or dot-peen text (OCR/OCV), verify a label.

Rule of thumb: write down which of the four tasks each station performs before you touch hardware. A "locate" job and a "gauge" job on the same part can demand wildly different cameras, lenses, and lighting.

Versus human inspection

Humans are spectacular general inspectors and terrible repeatable ones. A person catches the weird defect nobody specified — and then misses the obvious one after lunch on hour six. Machine vision is the inverse: it will check the same 12 features identically 10 million times, at 60+ parts/minute, with an audit trail and a saved image of every reject. Where it loses is novelty and judgment. The honest division of labour on most lines is: machine vision does the high-volume, well-defined checks; humans handle exceptions, setup, and the genuinely ambiguous calls.

The vision system anatomy

Every 2D vision station is the same five blocks. Get any one wrong and the others cannot save you.

Lighting — controls what the camera sees. Geometry (where the light comes from) and wavelength (what colour) make features appear or vanish. This is where robustness lives.
Optics (lens) — projects the scene onto the sensor at the right magnification, working distance, and depth of field, with enough resolving power (MTF) to support the pixel count.
Camera (sensor) — converts photons to pixels. Sensor size, resolution, shutter type, and frame rate set the imaging envelope.
Processing — runs the algorithms. Either inside a smart camera (embedded) or on a PC/industrial controller.
Communication — gets the result out: digital I/O, fieldbus (EtherNet/IP, PROFINET, EtherCAT), or a serial/TCP string to a PLC, robot, or MES.

The classic mistake is to spend the budget on the camera and the software and treat lighting as an afterthought — a desk lamp and hope. Reverse it.

The 80% rule: lighting and optics determine whether the feature is visible and stable in the image. If it is, almost any decent algorithm finds it. If it is not, no algorithm — classic or neural — will reliably recover it. Most vision engineers reckon lighting alone is 80% of the battle.

A concrete example: measuring the diameter of a turned steel pin. With a ring light you fight specular hotspots that move with every part and blow out the edge; your edge tool jitters by several pixels. Swap to a collimated backlight and the pin becomes a black silhouette on a bright field — the edge is now a high-contrast, sub-pixel-clean transition, repeatable to a fraction of a pixel. Same camera, same software, 10× better result, because the lighting did the work.

Image sensors: CMOS, shutters, format

The sensor is the photon-to-pixel converter, and four properties dominate selection: technology (CMOS vs CCD), shutter (global vs rolling), resolution/pixel size, and colour vs mono.

CMOS has won — but know why CCD lasted

CCD dominated industrial imaging into the 2010s for its uniformity and low noise. CMOS has since taken over almost completely because modern sensors — especially Sony's Pregius global-shutter line (IMX250/253/255, IMX540s) and the Pregius S stacked BSI family — match or beat CCD on noise and dynamic range while adding speed, lower power, and on-chip features. In 2026, unless you have a legacy reason, you are buying CMOS. Sony Pregius is the de-facto standard sensor family behind cameras from Basler, FLIR/Teledyne, Allied Vision, IDS, and others.

Global vs rolling shutter — the one that bites people

A rolling shutter exposes the sensor row by row, so different rows capture slightly different instants. On a static, well-lit scene that is fine. On a moving part it produces skew (a vertical line photographed during motion leans) and, with pulsed light, banding. A global shutter exposes every pixel simultaneously, freezing motion cleanly.

Rule: if the part moves during exposure — conveyor, indexer, robot end-of-arm, web — use global shutter. Rolling shutter is for static inspection or where you can fully stop the part.

You can sometimes rescue a rolling-shutter sensor with a short, bright strobe that ends before the rows finish reading out, but it is a workaround; a global-shutter sensor is the clean answer for motion.

Resolution, pixel size, and format

Resolution is the pixel count (e.g., 5 MP = 2448 × 2048). Pixel size (e.g., 3.45 µm, 2.74 µm, 4.5 µm) sets how much light each pixel gathers — bigger pixels mean better low-light/SNR but a larger, costlier sensor for the same count. Sensor format (1/2.9", 2/3", 1.1", APS-C-ish) must be matched by the lens image circle — a lens that only covers 2/3" will vignette badly on a 1.1" sensor.

Rule: the lens image circle must be ≥ the sensor diagonal, or you get dark, blurred corners. Always check the lens spec against the sensor format.

Mono vs colour, and NIR

Prefer mono unless colour carries information you need. A mono sensor has no Bayer filter, so it is more sensitive, sharper (no demosaic interpolation), and resolves finer detail at the same pixel count — better for measurement, defect, and code reading. Use colour only when the inspection genuinely depends on hue (sorting by colour, verifying a coloured wire, print colour QC). Many "colour" problems are better solved with a mono camera and a coloured light or filter. NIR-enhanced mono sensors (sensitive past 800–1000 nm) shine for seeing through certain inks/plastics, reducing glare, and IR-illuminated scenes.

Property	CCD	CMOS rolling shutter	CMOS global shutter
Status in 2026	Legacy	Common (cheap)	Industrial default
Motion handling	Good (global)	Poor — skew/smear	Excellent — freezes motion
Typical use	Legacy lines	Static scenes, microscopy	Conveyors, robots, line work
Read noise / DR	Very good	Good	Very good (Pregius S)
Example sensors	Sony ICX	Sony IMX2xx (RS variants)	Sony IMX250/253/Pregius S
Cost per pixel	High	Low	Moderate
Power	High	Low	Low–moderate

Industrial cameras and interfaces

The camera packages the sensor with readout electronics, a lens mount (C-mount up to ~16 MP/1.1"; larger M42/F-mount/M58 for big sensors), and a digital interface. Two architectural splits matter: area vs line scan, and the interface standard.

Area scan vs line scan

Area scan cameras capture a 2D frame at once — the default for discrete parts. Line scan cameras image a single line (e.g., 2k–16k pixels wide) thousands of times per second and build the image as the part moves under them. Line scan is the right tool for continuous web (paper, film, textile, metal coil), cylindrical parts rotated under the camera, and very high-resolution flat inspection where a single area sensor would be impractical. Line scan demands precise motion (usually an encoder driving line triggers) and serious lighting, but it delivers enormous effective resolution and no seams.

The interfaces

Pick the interface from your data rate, cable length, and camera count:

Data rate (bytes/s) = Width_px × Height_px × bytes_per_pixel × frame_rate

Example: 5 MP (2448 × 2048) mono 8-bit at 30 fps
  = 2448 × 2048 × 1 × 30
  ≈ 150 MB/s  →  exceeds GigE (~115 MB/s), fits USB3 / 5GigE / CoaXPress

GigE Vision — Gigabit Ethernet, ~115 MB/s usable, up to 100 m on Cat-5e/6, PoE option, easy multi-camera via switches. The workhorse for distributed and multi-camera systems. 5GigE and 10GigE extend the bandwidth on the same cabling philosophy.
USB3 Vision — ~350–400 MB/s usable, cheap, simple, but cable length limited to ~3–5 m (active cables further). Great for a single camera near the PC.
Camera Link — deterministic, low-latency parallel-ish interface, up to ~~6.8 Gbit/s (Deca; Full is ~5.4 Gbit/s), needs a frame grabber and short (~~10 m) cables. Long the high-speed line-scan standard; being displaced by CoaXPress.
CoaXPress (CXP) — coax cable, up to 12.5 Gbit/s per lane (CXP-12), aggregate >50 Gbit/s with multiple lanes, ~40 m reach, power-over-coax, needs a frame grabber. The modern choice for high-speed, high-res, and demanding line scan.

Rule: compute the data rate first, then add headroom. A camera that can run 100 fps does not have to — you are bandwidth-limited by your interface, and choosing a faster interface than you need wastes money and cabling complexity.

Smart camera vs PC-based

A smart camera (Cognex In-Sight, Keyence CV/IV/XG series, Datalogic, Omron) integrates sensor, optics mount, lighting drive, processor, and I/O in one IP67 housing, programmed through a guided environment. It deploys fast, survives the factory, and is ideal for one to a few well-defined checks per station. The ceiling is compute and flexibility.

A PC-based system (cameras + frame grabber/NIC into an industrial PC running Cognex VisionPro, MVTec Halcon, or OpenCV/custom) wins when you have many cameras, heavy computation, deep learning, or algorithms the smart camera's library cannot express. You pay in integration effort and a box that needs to survive the cabinet.

Interface / type	Bandwidth	Max cable	Frame grabber?	Best for
GigE Vision	~115 MB/s (1 GbE)	100 m (Cat-5e/6)	No (NIC)	Distance, multi-camera, PoE
5/10GigE	~575 MB/s / ~1.1 GB/s	~100 m / shorter	No (NIC)	Higher-res over Ethernet
USB3 Vision	~350–400 MB/s	~3–5 m	No	Single camera near PC
Camera Link	up to ~850 MB/s (Deca)	~10 m	Yes	Legacy high-speed line scan
CoaXPress (CXP-12)	12.5 Gbit/s/lane (×N)	~40 m	Yes	High-speed area & line scan
Smart camera	n/a (onboard)	n/a	No	Fast deploy, 1–few checks

Optics and lenses

The lens decides field of view, working distance, depth of field, and whether the sensor's pixels actually resolve anything. A great sensor behind a soft lens is wasted money.

The FoV / working-distance / sensor-size relationship

The governing relationship for a standard (entocentric) lens is similar triangles between the sensor and the scene:

Magnification  m = sensor_dimension / FoV   (also = focal_length / working_distance, approx.)

FoV ≈ (sensor_dimension × working_distance) / focal_length

Rearranged to pick a focal length:
focal_length ≈ (sensor_dimension × working_distance) / FoV

Worked example — you need a 100 mm wide FoV, the part sits 300 mm from the lens, and you are using a 2/3" sensor (8.45 mm wide):

focal_length ≈ (8.45 mm × 300 mm) / 100 mm ≈ 25 mm

So a 25 mm lens gets you close; you trim with working distance. Note the levers: longer focal length → narrower FoV (more zoom); longer working distance → wider FoV. You cannot freely change all three — pick two and the third follows.

Depth of field and aperture

Depth of field (DoF) is the range of working distance over which the image stays acceptably sharp. It grows with a smaller aperture (higher f-number, e.g., f/8 vs f/2.8) and shrinks with magnification. But stopping down costs light (you compensate with brighter lighting or longer exposure) and, past a point, diffraction softens the image — the diffraction-limited spot grows with f-number, so f/16 on a small-pixel sensor can be blurrier than f/5.6. There is a sweet spot, usually around f/4–f/8 for industrial lenses.

Rule: open the aperture for light and resolution, close it for depth of field — and stop before diffraction eats your sharpness. For most industrial work, f/4–f/8 is the productive band.

Resolution and MTF

A lens's resolving power is described by its MTF (modulation transfer function): how much contrast it preserves at a given spatial frequency (line pairs/mm). The lens MTF must support your pixel pitch — a lens that resolves 100 lp/mm pairs poorly with a 2.74 µm-pixel sensor (which wants ~180 lp/mm). Buy lenses rated for your sensor's resolution class; a "5 MP lens" on a 12 MP sensor throws away pixels. For high-res sensors, the lens is often the limiting factor, not the camera.

Telecentric lenses — why metrology demands them

A standard lens has perspective: closer objects look bigger. So if your part's height varies, or it shifts in working distance, its apparent size changes — a fatal error for gauging, often 1–5% per few mm of position change. A telecentric lens has its entrance pupil at infinity, so within its (limited) telecentric range, magnification is constant regardless of object distance. A part that moves toward or away from the lens does not change size in the image, and there is no perspective distortion of edges.

The price: a telecentric lens's front element must be at least as large as the FoV (so a 50 mm telecentric is physically big and expensive), and the working range is limited. But for precision dimensional measurement — gear teeth, connector pins, machined parts — telecentric is the only honest choice. Pair it with a collimated telecentric backlight and you get the cleanest possible measurement geometry.

Rule: for measurement to better than ~1%, use a telecentric lens. For locate/inspect/read where a few percent perspective is harmless, a standard fixed-focal lens is fine and far cheaper.

Lighting: the most neglected 80%

If you remember one thing from this guide: the right light makes the feature obvious; the wrong light makes it impossible. Lighting controls contrast, suppresses glare, and selectively reveals texture, edges, or surface defects. It is the highest-leverage, lowest-cost decision on the whole station, and it is the one most often skipped.

Two axes: geometry (where the light comes from relative to the camera and part) and spectrum (wavelength/colour). Plus the temporal choice: strobe vs continuous.

Geometry — the techniques

Ring light — LEDs around the lens, frontal, general-purpose illumination. Easy and bright, but it creates specular hotspots on shiny/curved parts. Fine for matte, flat features; trouble for reflective ones.
Bar / linear light — directional grazing or floodlight; angled low it casts shadows that emphasize embossing, scratches, and surface relief.
Dome ("cloudy day") light — diffuse light from a hemispherical dome, near-shadowless and glare-free. The answer for shiny, curved, or specular parts (foil seals, polished metal) where you want even illumination without hotspots.
Backlight — light behind the part, camera sees a silhouette. The single best choice for measurement and presence of edges/holes: maximum contrast, sub-pixel-clean edges, immune to surface texture and colour. Use a collimated/telecentric backlight with a telecentric lens for the cleanest gauging.
Coaxial (on-axis) light — light injected through a beamsplitter so it travels along the optical axis; flat specular surfaces (wafers, polished metal, glass) reflect straight back and look bright, while tilted/textured features go dark. Excellent for flat reflective surfaces and reading marks on them.
Dark-field — light at a very low angle so a flat surface looks dark and only edges, scratches, and raised defects scatter light back to the camera. Superb for surface scratch detection and engraved/laser marks.

Spectrum and strobe

Colour matters. Red (~~625 nm) is cheap, bright, and gives sharp images (less chromatic blur, good with mono sensors); blue (~~470 nm) gives finer detail (shorter wavelength) and good contrast on red/metallic parts; IR (850–940 nm) reduces glare, sees through some plastics/inks, and ignores ambient colour; UV (~365–405 nm) excites fluorescence for invisible-mark and adhesive verification. A classic trick: use a coloured light and a mono camera to make a feature pop — e.g., a red part on a red background vanishes under red light (both bright) but stands out under blue.

Strobe vs continuous. A strobed (pulsed) light fired in sync with a short exposure freezes fast motion and lets you over-drive LEDs far above continuous rating for a brief, bright flash — essential for high-speed lines. Continuous light is simpler and fine for slow or static inspection. Strobing also fights ambient light: a bright, short pulse swamps room lighting during the exposure window.

Rule: enclose the station and control ambient light. The most repeatable vision systems are in shrouds or enclosures; the flakiest are open to a window, a forklift's headlights, and the seasonal sun.

Technique	Geometry	Reveals	Best for	Watch out for
Ring	Frontal, around lens	General surface	Matte, flat features	Glare on shiny/curved parts
Bar / linear	Angled / grazing	Relief, texture, scratches	Embossing, weld, surface	Uneven field if mis-aimed
Dome	Diffuse, hemispherical	Even, shadowless	Shiny/curved, foil, metal	Bulky; lower intensity
Backlight	Behind part	Silhouette / edges	Measurement, holes, presence	Only outlines, not surface
Coaxial (on-axis)	Along optical axis	Flat specular detail	Wafers, polished metal, marks	Needs flat, normal surface
Dark-field	Very low angle	Edges, scratches, marks	Surface defects, engraving	Dark overall; tight geometry

The measurement chain and calibration

A vision measurement is only as good as its weakest link: feature → photons → optics → pixels → algorithm → millimetres. Calibration is what makes the last step legitimate.

Pixels per millimetre and spatial resolution

Spatial resolution — pixels per millimetre (PPM), the inverse of which is millimetres per pixel — is the bridge between image and world:

PPM = sensor_resolution_px / FoV_mm
mm_per_pixel = FoV_mm / sensor_resolution_px

Example: 2448 px across a 100 mm FoV
  PPM = 2448 / 100 ≈ 24.5 px/mm
  mm/px = 100 / 2448 ≈ 0.041 mm/px (41 µm/px)

The Nyquist rule for feature detection

To reliably detect (not just measure) a feature, you need enough pixels across it. The sampling theorem says you need at least 2 pixels across the smallest feature to register it at all, but in practice noise and reliability push you to 3–5 pixels minimum across the smallest defect or feature you must catch:

Required PPM = (pixels_across_feature) / smallest_feature_mm

Example: must detect a 0.2 mm scratch, want 4 px across it
  Required PPM = 4 / 0.2 = 20 px/mm
  → at that PPM, a 100 mm FoV needs 100 × 20 = 2000 px → a ≥3 MP sensor

Rule: detection needs ~3–5 px across the feature; measurement to a tolerance needs that plus sub-pixel edge fitting and calibration. If you only have 2 px on the feature, you are gambling.

Sub-pixel and accuracy

Good edge and pattern tools fit the intensity profile to find an edge to a fraction of a pixel — typically 1/10 to 1/40 of a pixel under clean, high-contrast conditions. So at 41 µm/px, a 1/20-pixel edge tool can repeat to ~2 µm. But sub-pixel is a precision claim, not an accuracy claim: accuracy also requires removing lens distortion and establishing the true scale, which is what calibration does. Distinguish repeatability (same part, same number) from accuracy (number matches a traceable standard) — quote both.

What calibration actually does

You image a precision target (a dot grid or checkerboard of known spacing) at the working distance. The software then:

builds the pixel-to-mm map (the real scale, not the nominal one),
removes lens distortion (barrel/pincushion) so straight edges measure straight,
corrects for perspective if the camera is not perfectly perpendicular,
and, for guidance, ties the image frame to the robot or stage coordinate frame.

Skip calibration and your measurements have arbitrary units and uncorrected distortion that grows toward the image edges. For metrology, verify against a traceable artifact (gauge block, calibrated ring) and track gauge R&R.

Vision processing: classic vs deep learning

The algorithm runs after the imaging is right. In 2026 you choose between mature classic (rules-based) tools and deep-learning models — and the engineering skill is knowing which fits which problem.

Classic / rules-based tools

These are deterministic, fast, explainable, and need no training data:

Blob analysis — segment by threshold, count/measure connected regions (presence, area, count, centroid).
Edge / caliper tools — find edges along a search line to sub-pixel, measure distances, widths, diameters. The backbone of gauging.
Template / pattern matching — find a learned shape. Modern geometric pattern matching (Cognex PatMax, Halcon shape-based matching) is rotation-, scale-, and partial-occlusion-tolerant and is the standard for locate/guidance.
OCR/OCV — read or verify printed/marked characters against a font library.
Barcode/2D code reading — decode 1D, QR, DataMatrix, including damaged codes.

When the spec is crisp — a dimension, a known shape, a code, a present/absent — classic tools are faster, cheaper, fully explainable, validate cleanly for regulated industries, and do not drift. They are the right default for locate, measure, and read.

Deep learning

CNN-based tools (Cognex ViDi/Deep Learning, Halcon Deep Learning, Keyence, plus open frameworks) shine where "defect" is hard to define in pixels and easy to show by example:

Defect detection / anomaly — scratches, stains, weave irregularities on variable surfaces (textiles, castings, food) where appearance varies part-to-part.
Classification — sort into categories that resist explicit rules.
OCR on hard text — deformed, low-contrast, varied fonts where classic OCR fails.
Segmentation — pixel-level defect mapping.

The cost: training data (hundreds to thousands of labelled images, including enough defects), GPU or NPU inference, less explainability, and the risk of drift when the process changes. For surface/cosmetic inspection, deep learning often wins decisively. For a measurement, it is the wrong tool.

Rule: if you can write the pass/fail rule in one sentence with numbers, use classic vision. If you can only define it by pointing at examples, reach for deep learning — and budget for the labelled dataset.

Edge inference

Inference increasingly runs at the edge — on the smart camera, an industrial PC, or an NPU/GPU accelerator near the line — to keep cycle time deterministic and avoid shipping every frame to a server. A modern smart camera or vision controller running an optimized CNN can classify in a few milliseconds, well inside a typical sort budget. This dovetails with the determinism concerns in real-time control.

Robot and vision integration

Vision-guided robotics is where 2D vision earns its keep on a line: the camera finds the part, the robot picks it. The hard part is geometry — getting the camera and robot to agree on where "there" is.

Eye-in-hand vs eye-to-hand

Eye-in-hand — the camera is mounted on the robot wrist/flange. It moves with the arm, so it can look closely and from multiple poses, and one camera can serve a large workspace. The transform you solve for is camera-to-flange (constant). Great for inspection-while-moving and adaptive picking; the trade is the camera rides the arm's vibration and cabling.
Eye-to-hand — a fixed camera looks at the workspace (e.g., overhead a conveyor). Simpler mechanically, stable, and ideal when the parts come to a known region. The transform is camera-to-robot-base (constant). The trade is fixed FoV and possible occlusion by the arm.

Hand-eye calibration

Either way, you must find the rigid transform between the camera's coordinate frame and the robot's. Hand-eye calibration solves the classic AX = XB problem: you move the robot to several known poses while imaging a calibration target, and from the corresponding robot poses and image observations you solve for the unknown transform X. Done well, it lets the robot convert a pixel location into a pick pose in its own base frame. The math lives in the motion planning & kinematics world; the robot side is in industrial robot arms.

2D, 2.5D, and the line to 3D

Pure 2D guidance gives x, y, and rotation on a known, flat plane — perfect for picking flat parts off a conveyor at a fixed height. 2.5D adds a coarse height (e.g., from focus or a known part). When parts are stacked, jumbled in a bin, or vary in pose in all six degrees of freedom, you have crossed into 3D vision — point clouds, structured light, depth cameras — which is a separate discipline covered in the LiDAR & depth cameras guide. Know the boundary: do not try to solve a random-bin-pick with a single 2D camera.

Picking from a moving belt

A common pattern is conveyor tracking: an encoder reports belt position, the camera (eye-to-hand, overhead) locates parts as they enter the FoV, and the robot — running a motion planning layer with the belt encoder — picks them on the fly. This is the canonical machine-vision-plus-robot job and it leans on every part of this guide: global shutter to freeze the part, a strobe synced to the trigger, and a clean image-to-robot transform. For collaborative cells where the robot shares space with people, see collaborative robots (cobots).

Triggering, timing, and the PLC interface

On a line, the vision system is a slave to the machine's timing. Getting the trigger, exposure, and result handshake right is what makes it production-grade rather than a demo.

Hardware trigger and exposure

A hardware trigger — a digital pulse from a photoeye, proximity sensor, PLC output, or encoder — tells the camera exactly when to grab. Software triggering over the bus has jitter you cannot tolerate at speed; hardware triggering is deterministic to microseconds. On the trigger, the camera opens the exposure for a set time (often very short, e.g., 50–500 µs, with a synchronized strobe) to freeze the part. Exposure choice trades motion blur against light: shorter exposure freezes motion but needs more light (brighter strobe, larger aperture).

Max exposure to keep blur under 1 px:
  exposure_max ≈ (mm_per_pixel) / part_speed_mm_per_s

Example: 0.04 mm/px, belt at 500 mm/s
  exposure_max ≈ 0.04 / 500 = 80 µs  → strobe a bright pulse within 80 µs

The result handshake

After processing, the system must report a result before the next part arrives. Common paths:

Digital I/O — a pass/fail line plus a strobe/ready handshake. Simple, fast, deterministic.
Fieldbus — EtherNet/IP, PROFINET, or EtherCAT carrying richer data (which feature failed, measured value, part ID) to the PLC. The norm for modern lines; see industrial automation.
TCP/serial string — a result string to a robot controller or MES.

The handshake matters: the PLC must know the result corresponds to this part, not the previous one. Use a clear request/response or buffered FIFO scheme so a slow inspection cannot mis-associate a reject with the wrong part — at speed, an off-by-one rejects good parts and passes bad ones.

Throughput budget

Cycle time is the sum of the chain, not just "the CPU":

cycle_time ≈ trigger_delay + exposure + sensor_readout + transfer + processing + result_out

Example (5 MP GigE): trigger ~0 + exposure 0.2 ms + readout ~5 ms
  + transfer ~13 ms (150 MB/s link) + processing 10 ms + I/O 1 ms ≈ 29 ms
  → ~34 parts/s ceiling for ONE camera on a 1 GbE link

Rule: budget the whole chain. The interface transfer time and sensor readout are often bigger than the algorithm — which is exactly why interface choice (USB3/CXP) can buy more throughput than a faster PC. Real-time determinism here connects to real-time control.

Applications and accuracy expectations

Mapping the four tasks onto real jobs, with the accuracy you can honestly expect.

Presence / absence and assembly verification

The bread-and-butter check: is the cap on, the gasket seated, the connector latched, all the screws present? Usually solved with blob or pattern tools and good lighting. Reliability is excellent (often >99.9%) when the feature is well-lit and contrasty. Assembly verification extends this — counting components, checking orientation, confirming the right variant is built.

Gauging / dimensional measurement

Measuring diameters, gaps, widths, angles, position-to-tolerance. With a telecentric lens + backlight + calibration, repeatability of ±0.5–5 µm is achievable on the right setup; field accuracy of ±0.01–0.05 mm is realistic on a well-built station. Without telecentric optics, expect worse and beware perspective error. This is the most demanding task and the one where cutting optical corners shows up immediately as gauge R&R failure.

Surface defect inspection

Scratches, dents, contamination, stains, porosity, print quality. Lighting is everything here — dark-field for scratches, dome for shiny surfaces, grazing bar light for relief. For well-defined defects, classic tools; for variable cosmetic defects, deep learning. Catch rates depend entirely on whether the defect is visible under the chosen light; the algorithm is secondary.

Code reading and OCR/OCV

Reading 1D barcodes, QR, and DataMatrix (the dominant 2D code for direct part marking — laser/dot-peen), and reading or verifying printed text. Cognex and Keyence reader algorithms decode degraded codes that look unreadable to the eye. Verification (OCV) confirms the right text is present and legible. Expect near-100% read rates on quality codes with proper lighting (often coaxial or dome for DPM on metal); poor marks drag rates down fast.

Web / continuous inspection

Line-scan inspection of paper, film, foil, textile, glass, metal coil at high speed, flagging defects per metre. High resolution, encoder-synced, heavy lighting and bandwidth (CoaXPress territory).

Designing and selecting a vision system

This is the workflow that prevents the expensive mistakes. Work outward from the feature, never inward from a camera you already own.

The spec-out sequence

1. Define the task(s): locate / measure / inspect / read — per station.
2. Identify the smallest critical feature (mm) and the tolerance (mm).
3. Set pixels-across-feature: 3–5 px for detect; more + sub-pixel for measure.
4. Compute required PPM:    PPM = pixels_across_feature / feature_mm
5. Compute required sensor: sensor_px = PPM × FoV_mm   (do per axis)
6. Choose sensor (round UP to a real resolution; add margin) + shutter type.
7. Choose lens: focal_length ≈ (sensor_dim × working_distance) / FoV
   — telecentric if measuring to <~1%.
8. Choose lighting geometry + wavelength for the feature/surface.
9. Choose interface from data rate + cable length + camera count.
10. Choose architecture: smart camera vs PC-based (Halcon/VisionPro/OpenCV).
11. Define trigger, exposure/strobe, and the PLC/robot handshake.
12. Calibrate, validate against a standard, measure gauge R&R.

Worked sizing example

You must measure a 0.10 mm tolerance on a connector pin across a 40 mm × 30 mm field, want to measure (so 5 px + sub-pixel), smallest critical feature 0.3 mm:

Detect/measure target: 5 px across 0.3 mm → required PPM = 5 / 0.3 ≈ 16.7 px/mm
Sensor needed: X = 16.7 × 40 ≈ 668 px ; Y = 16.7 × 30 ≈ 500 px
  → tiny by detection rules, BUT tolerance is 0.10 mm:
mm/px must be << 0.10; aim for sub-pixel margin → target ~0.02 mm/px (≈50 px/mm)
  X = 50 × 40 = 2000 px ; Y = 50 × 30 = 1500 px → ≥3 MP sensor
Optics: telecentric (0.10 mm tolerance) sized for ≥40 mm FoV.
Lighting: collimated/telecentric backlight for clean silhouette edges.

Notice the measurement tolerance, not the detection rule, set the resolution. That is the usual outcome for gauging: tolerance dominates.

Rule: when measuring, let the tolerance drive PPM (aim for the tolerance to span many pixels so sub-pixel fitting has room); when only detecting, let the feature size drive it. Confusing the two is the most common sizing error.

Choosing the architecture and vendor

For one or two well-defined checks per station with modest compute, a smart camera (Cognex In-Sight, Keyence) is the fastest path to a working, ruggedized station. For many cameras, heavy or deep-learning compute, or algorithms outside the smart-camera library, go PC-based with VisionPro, Halcon, or OpenCV on an industrial PC, with Basler/FLIR (Teledyne)/Allied Vision cameras on the appropriate interface. Match the camera's sensor (Sony Pregius/Pregius S) and the lens MTF to your resolution class, and budget real engineering time for lighting and calibration — that is where the project succeeds or fails.

Final rule: a vision system is matched to its feature, its tolerance, its surface, and its line speed — there is no universal "best" camera. Do the FoV/PPM math first, spend on lighting and optics, and the algorithm becomes the easy part.

Frequently asked questions

Why is lighting considered the most important part of a vision system? Because lighting determines whether the feature is visible and stable in the image, and no algorithm — classic or neural — can reliably recover a feature the imaging never captured. The right geometry (backlight, dome, dark-field) and wavelength turn a hard problem into a trivial one, while the wrong light injects glare and variability that no software fully fixes. It is also the cheapest fix on the station, which is why experienced engineers treat it as ~80% of the job.

Global shutter or rolling shutter — how do I decide? If the part moves during exposure (conveyor, indexer, robot end-of-arm, web inspection), use global shutter; it freezes every pixel at the same instant. Rolling shutter exposes row by row and skews/smears moving objects. Rolling shutter is acceptable only for fully static scenes. In 2026 the Sony Pregius global-shutter family is the industrial default.

Mono or colour camera? Default to mono. Without a Bayer filter, mono is more sensitive and sharper at the same pixel count — better for measurement, defect detection, and code reading. Use colour only when hue genuinely carries the information you need (colour sorting, print colour QC). Many apparent colour problems are solved better with a mono sensor plus a coloured light or filter.

What is a telecentric lens and when do I need one? A telecentric lens holds magnification constant across its depth of field, so an object that moves toward or away from the lens does not change apparent size and there is no perspective distortion. You need it for dimensional measurement to better than about 1% — gear teeth, pins, machined parts. For locate/inspect/read, where a few percent perspective is harmless, a standard fixed-focal lens is fine and far cheaper. Telecentric lenses are physically large (front element ≥ FoV) and costly.

How do I calculate what camera resolution I need? Work from the feature, not the camera. Decide pixels-across-feature (3–5 px to detect; more, plus sub-pixel, to measure), compute required PPM = pixels / feature_mm, then sensor pixels = PPM × FoV_mm per axis, and round up to a real sensor with margin. For measurement, let the tolerance drive PPM — aim for the tolerance to span many pixels so sub-pixel edge fitting has room.

GigE Vision, USB3 Vision, or CoaXPress — which interface? Compute your data rate (W × H × bytes/px × fps), then pick by bandwidth, cable length, and camera count. GigE (~~115 MB/s, 100 m, easy multi-camera) for distance and distributed systems; USB3 (~~350 MB/s, ~3–5 m) for a cheap single camera near the PC; CoaXPress (12.5 Gbit/s/lane, ~40 m, needs a frame grabber) for high-speed and high-res, including demanding line scan. Camera Link is the legacy high-speed option being displaced by CXP.

Smart camera or PC-based vision? Smart cameras (Cognex In-Sight, Keyence) integrate everything in a rugged housing and deploy fastest for one or a few well-defined checks per station. PC-based systems (VisionPro, Halcon, OpenCV on an industrial PC with Basler/FLIR cameras) win on many cameras, heavy or deep-learning compute, and custom algorithms — at the cost of more integration effort.

When should I use deep learning instead of classic vision? Use classic, rules-based tools (blob, edge/caliper, geometric pattern match, OCR, code reading) when you can state the pass/fail rule with numbers — locate, measure, read. Use deep learning when "defect" is hard to define in pixels but easy to show by example — variable cosmetic/surface defects, hard OCR, classification. Deep learning needs a labelled dataset (including enough defects), inference hardware, and accepts less explainability and possible drift.

What accuracy can I realistically expect from a 2D gauging system? With a telecentric lens, backlight, and proper calibration, repeatability of about ±0.5–5 µm and field accuracy of ±0.01–0.05 mm are achievable on a well-built station, because good edge tools resolve ~1/10–1/40 of a pixel. Quote repeatability and accuracy separately and validate against a traceable standard with gauge R&R. Without telecentric optics, perspective error degrades these quickly.

What is hand-eye calibration? It is finding the rigid transform between the camera's coordinate frame and the robot's, so a pixel location converts into a pick pose in the robot's base frame. You image a calibration target at several known robot poses and solve the AX = XB problem. It applies to both eye-in-hand (camera on the wrist) and eye-to-hand (fixed overhead) setups. See the motion planning and industrial robot arm guides for the kinematics.

Can I do 3D bin-picking with a single 2D camera? No. A single 2D camera gives you x, y, and rotation on a known plane (and at best a coarse 2.5D height). Random parts jumbled in a bin vary in all six degrees of freedom and require 3D vision — structured light, stereo, or ToF point clouds. That is a separate discipline; see the LiDAR & depth cameras guide.

Why does my measurement drift between morning and afternoon? Almost always ambient light or thermal effects. Uncontrolled room light, sunlight through a window, or a nearby machine's lamp changes the scene between shifts; enclose the station and use a strobe to swamp ambient. Thermal expansion of the part, fixture, or lens mount also shifts measurements — let the system warm up, control temperature where you can, and re-validate against a standard periodically.

Table of contents