File format

DICOM File Format

A DICOM file is a binary stream with a strict layout: a 128-byte preamble, a four-byte magic marker, a File Meta Information Header encoded in Explicit VR Little Endian, and a Data Set encoded according to whichever transfer syntax that header declares. Get the header wrong and the file is unreadable; get the transfer syntax wrong and the pixel data decodes to garbage.

File layout at a glance

OffsetLengthContents
0128 bytesPreamble — application-defined, usually all zeros
1284 bytesMagic: DICM (ASCII)
132variableFile Meta Information Header (group 0002) — always Explicit VR Little Endian
after FMIvariableData Set — encoding defined by (0002,0010) Transfer Syntax UID

A file that skips the 128-byte preamble and DICM is still valid DICOM as a network stream but not as a Part-10 file. If a viewer can open a file from disk but the same bytes on the wire don't decode, this is usually the reason.

The File Meta Information Header

Group 0002 — the File Meta Information — is always encoded in Explicit VR Little Endian regardless of what encoding the Data Set uses. It's the "how to read the rest of this file" metadata. The core elements are:

  • (0002,0000) File Meta Information Group Length (UL) — tells you where the Data Set starts
  • (0002,0001) File Meta Information Version (OB) — constant 00 01
  • (0002,0002) Media Storage SOP Class UID (UI) — what kind of object (CT Image, MR Image, SR, …)
  • (0002,0003) Media Storage SOP Instance UID (UI) — unique ID for this object
  • (0002,0010) Transfer Syntax UID (UI) — critical; dictates the Data Set encoding
  • (0002,0012) Implementation Class UID (UI) — what wrote this file

Data Set encoding — VR explicit vs implicit

The Data Set is where the actual study metadata and pixel data live. DICOM allows two top-level encodings:

  • Implicit VR Little Endian (UID 1.2.840.10008.1.2) — the default. Tag (4 bytes) + length (4 bytes) + value. The reader must look up each tag in the Data Dictionary to know its VR. Smallest on wire, worst for unknown tags.
  • Explicit VR Little Endian (UID 1.2.840.10008.1.2.1) — tag (4 bytes) + VR (2 bytes) + length (2 or 4 bytes) + value. Self-describing — a reader can process unknown tags without the dictionary.

Most modern PACS negotiate Explicit VR Little Endian for interoperability. Older modalities often default to Implicit.

Transfer syntaxes and pixel data

For pixel data specifically, the transfer syntax also defines compression. DICOM supports a range:

  • Uncompressed — Implicit/Explicit VR Little/Big Endian
  • JPEG Baseline + Extended — lossy, legacy
  • JPEG Lossless + JPEG 2000 Lossless — the modern lossless choice
  • RLE Lossless — run-length encoding, fast but modest compression
  • JPEG-LS Lossless — best ratio for 12/16-bit radiographic images
  • HEVC / MPEG-4 — for multi-frame ultrasound / endoscopy

Transfer syntax mismatch is the #1 cause of "the image won't open" tickets. A device claims to accept Implicit VR Little Endian (the one transfer syntax every SCP must accept) then gets sent JPEG 2000 — association negotiation should catch this but often doesn't because the upstream modality defaults to whatever it prefers.

Sequence encoding

Some DICOM attributes are sequences — lists of nested datasets. Think Referenced Image Sequence, Source Image Sequence, Procedure Code Sequence. These use VR SQ and carry either:

  • Defined length sequences — fixed byte count, items delimited by item tags (FFFE,E000)
  • Undefined length sequences — length field is 0xFFFFFFFF, sequence terminated by (FFFE,E0DD)

Undefined-length sequences are the ones that trip up naive parsers. The presence of an item delimiter (FFFE,E00D) inside an item is how most robust parsers know to keep scanning.

Validating a DICOM file

To sanity-check an unknown .dcm file before you debug anything else:

  1. Confirm bytes 128-131 are DICM
  2. Parse group 0002, read the Transfer Syntax UID
  3. Verify the Transfer Syntax UID is registered — see the TS browser
  4. Try decoding with a known library (DCMTK's dcmdump, pydicom's dcmread, dicomParser)

If step 1 fails, the file was likely written as raw DIMSE without Part-10 wrapping, or the extension is lying. If step 3 fails on a vendor-specific UID, the device is using a private transfer syntax — look up the conformance statement.

Explore further