How to Parse FHIR JSON Bundles in Python: Production Clinical ETL Implementation

Q: My Observation rows lose their numeric values after parsing. Why?

FHIR uses value[x] polymorphism, so the value sits under a typed key such as valueQuantity, valueString, or valueCodeableConcept. Code that reads a fixed valueQuantity silently drops every other variant. Branch on the concrete key present and coerce per type.

Q: I get dangling references when parsing a transaction bundle.

Entries reference each other through fullUrl placeholders that only become real ids after the server assigns them. Index by entry.fullUrl before resolving, and honor If-None-Exist so retries do not duplicate records.

Q: My parser runs out of memory on large exports.

json.load buffers the whole document. Iterate with ijson.items(f, 'entry.item') on a binary handle, ensure no downstream step accumulates the full list, and for very large feeds switch the source to a bulk NDJSON export so each line is independent.

Q: Row counts look inflated compared to the server's total.

You are counting _include/_revinclude entries as results. Filter on entry.search.mode == 'match' for projection, and use include entries only to populate the reference index.

The hardest part of ingesting a FHIR Bundle is rarely throughput — it is deterministic resource extraction from a heterogeneous JSON envelope without leaking protected health information (PHI) or silently dropping clinical facts. This page solves one narrow, recurring problem: given a paginated searchset bundle that mixes match, include, and outcome entries, how do you stream it in Python, resolve cross-resource references, and project clean rows into an analytical schema? It is the runnable companion to the FHIR resource hierarchy guide, which explains the containment-and-reference graph these parsers must respect; here we turn that topology into working code.

Unlike the positional, pipe-delimited segments described in the HL7 v2 message structure breakdown, a FHIR bundle is a typed JSON tree: every entry carries an explicit resourceType, references form a graph rather than a flat list, and numeric values hide behind value[x] polymorphism. A parser that ignores any of those three facts will compile, run, and quietly corrupt your warehouse.

Bundle.type and entry.search.mode Quick Reference

Two fields decide how every entry is handled: Bundle.type (what kind of bundle this is) and entry.search.mode (why a given entry is present in a searchset). Branch on both before you extract a single value.

`Bundle.type`	Meaning	Parser strategy
`searchset`	Query results, paginated	Follow `link[relation=next]`; track a cursor; expect mixed `search.mode`
`transaction` / `batch`	Write operations	Resolve `fullUrl` placeholders before commit; honor `If-None-Exist`
`document`	Clinical document (Composition first)	Flatten hierarchically from the root `Composition`
`collection`	Loosely grouped resources	No ordering or pagination guarantees; treat as a flat set
`history`	Versioned resource stream	De-duplicate on `id` keeping the latest `meta.versionId`

`entry.search.mode`	What it is	What the parser should do
`match`	A primary hit for the query	Project to analytical rows
`include`	Pulled in via `_include` / `_revinclude`	Index for reference resolution; do not count as a result
`outcome`	An `OperationOutcome` (warning/error)	Log severity, route to quarantine, never ingest as clinical data

The single most common defect in clinical FHIR pipelines is counting include entries as query results, which inflates row counts and double-loads referenced Patient records.

Implementation Pattern: Streaming Parser End to End

The example below streams a searchset bundle, validates the root, classifies each entry by search.mode, builds a reference index, and projects PHI-safe Observation rows. It uses ijson for bounded-memory iteration and pydantic for a lightweight structural contract. For bundles larger than a few hundred megabytes, prefer a bulk NDJSON export, where each line is an independent resource and no enclosing array must be buffered.

import json
import logging
import hashlib
import decimal
from typing import Any, Dict, Iterator, Optional

import ijson
from pydantic import BaseModel, ValidationError

logger = logging.getLogger("fhir_etl_parser")
logging.basicConfig(format="%(asctime)s | %(levelname)s | %(message)s", level=logging.INFO)

SUPPORTED_BUNDLE_TYPES = {"searchset", "transaction", "batch", "collection", "document", "history"}


class FHIRResourceStub(BaseModel):
    """Minimal structural contract enforced on every entry resource."""
    resourceType: str
    id: Optional[str] = None
    meta: Optional[Dict[str, Any]] = None


def validate_bundle_root(bundle_path: str) -> str:
    """Confirm the document is a Bundle and return its declared type."""
    with open(bundle_path, "rb") as f:
        root_type = next(ijson.items(f, "resourceType"), None)
    if root_type != "Bundle":
        raise ValueError(f"Invalid root resourceType: expected 'Bundle', got {root_type!r}")
    with open(bundle_path, "rb") as f:
        bundle_type = next(ijson.items(f, "type"), None)
    if bundle_type not in SUPPORTED_BUNDLE_TYPES:
        logger.warning("Unsupported Bundle.type: %s", bundle_type)
    return bundle_type


def stream_entries(bundle_path: str) -> Iterator[Dict[str, Any]]:
    """Yield (search_mode, resource) tuples without buffering the whole file.

    decimal.Decimal preserves clinical measurement precision; float would
    silently round Observation.valueQuantity values.
    """
    with open(bundle_path, "rb") as f:
        for entry in ijson.items(f, "entry.item", use_float=False):
            resource = entry.get("resource")
            if resource is None:
                logger.warning("Skipping entry with no 'resource' block")
                continue
            try:
                FHIRResourceStub.model_validate(resource)
            except ValidationError as exc:
                logger.error("Entry failed structural validation: %s", exc)
                continue
            search_mode = (entry.get("search") or {}).get("mode", "match")
            full_url = entry.get("fullUrl")
            yield {"mode": search_mode, "fullUrl": full_url, "resource": resource}


def build_reference_index(entries: list[Dict[str, Any]]) -> Dict[str, Dict[str, Any]]:
    """Index every resource by fullUrl and by Type/id for O(1) resolution."""
    index: Dict[str, Dict[str, Any]] = {}
    for item in entries:
        res = item["resource"]
        res_id = res.get("id")
        full_url = item.get("fullUrl") or (f"urn:uuid:{res_id}" if res_id else None)
        if full_url:
            index[full_url] = res
        if res_id:
            index[f"{res['resourceType']}/{res_id}"] = res
    return index


def phi_safe_token(value: str) -> str:
    """Deterministic, non-reversible token for joins without exposing PHI."""
    return hashlib.sha256(value.encode("utf-8")).hexdigest()[:16]


def project_observation(index: Dict[str, Dict[str, Any]], obs: Dict[str, Any]) -> Optional[Dict[str, Any]]:
    """Flatten one Observation into an analytical row, resolving its subject."""
    subject_ref = (obs.get("subject") or {}).get("reference")
    patient = index.get(subject_ref) if subject_ref else None
    patient_token = phi_safe_token(patient["id"]) if patient and patient.get("id") else None

    # value[x] polymorphism: the numeric value lives under a typed key.
    value, unit = None, None
    if "valueQuantity" in obs:
        value = obs["valueQuantity"].get("value")
        unit = obs["valueQuantity"].get("unit")
    elif "valueString" in obs:
        value = obs["valueString"]

    coding = (obs.get("code") or {}).get("coding") or [{}]
    return {
        "observation_id": obs.get("id"),
        "patient_token": patient_token,
        "code_system": coding[0].get("system"),
        "code": coding[0].get("code"),
        "value": value,
        "unit": unit,
        "effective": obs.get("effectiveDateTime"),
    }


def parse_searchset(bundle_path: str) -> list[Dict[str, Any]]:
    validate_bundle_root(bundle_path)
    entries = list(stream_entries(bundle_path))           # holds dict refs, not raw JSON text
    index = build_reference_index(entries)                # includes + matches both indexed
    rows = []
    for item in entries:
        if item["mode"] == "outcome":
            logger.warning("OperationOutcome in searchset; routing to quarantine")
            continue
        if item["mode"] != "match":
            continue                                      # 'include' entries resolve refs only
        res = item["resource"]
        if res["resourceType"] == "Observation":
            row = project_observation(index, res)
            if row:
                rows.append(row)
    return rows

Three design choices carry the load. First, use_float=False (and, for json.load paths, parse_float=decimal.Decimal) keeps lab values exact. Second, both match and include entries are indexed, but only match entries become rows — so a referenced Patient is resolvable without being mistaken for a query hit. Third, phi_safe_token produces a stable join key without ever materializing a raw MRN downstream.

For converting the resolved rows into typed model objects rather than dictionaries, see using FHIR resource libraries for Python ETL; for normalizing value[x], dates, and quantities into a consistent warehouse schema, see type coercion for clinical data types.

Following Pagination

A searchset is rarely a single file. Drive the cursor off the next link and stop only when it is absent:

import requests

def iterate_pages(start_url: str, headers: Dict[str, str]) -> Iterator[Dict[str, Any]]:
    url = start_url
    while url:
        resp = requests.get(url, headers=headers, timeout=30)
        resp.raise_for_status()
        bundle = resp.json(parse_float=decimal.Decimal)
        yield bundle
        url = next(
            (l["url"] for l in bundle.get("link", []) if l.get("relation") == "next"),
            None,
        )

Validation and Testing

Parsers fail silently, so assert on shape, not just absence of exceptions. Build a small golden bundle and pin the expected output:

def test_searchset_drops_includes_and_outcomes(tmp_path):
    golden = {
        "resourceType": "Bundle",
        "type": "searchset",
        "entry": [
            {"fullUrl": "urn:uuid:p1", "search": {"mode": "include"},
             "resource": {"resourceType": "Patient", "id": "p1"}},
            {"search": {"mode": "match"},
             "resource": {"resourceType": "Observation", "id": "o1",
                          "subject": {"reference": "urn:uuid:p1"},
                          "code": {"coding": [{"system": "http://loinc.org", "code": "8867-4"}]},
                          "valueQuantity": {"value": 72, "unit": "/min"}}},
            {"search": {"mode": "outcome"},
             "resource": {"resourceType": "OperationOutcome",
                          "issue": [{"severity": "warning", "code": "incomplete"}]}},
        ],
    }
    path = tmp_path / "bundle.json"
    path.write_text(json.dumps(golden))

    rows = parse_searchset(str(path))

    assert len(rows) == 1                      # only the 'match' Observation
    assert rows[0]["code"] == "8867-4"
    assert rows[0]["value"] == 72
    assert rows[0]["patient_token"] is not None  # include entry resolved the subject
    assert all("p1" != r["patient_token"] for r in rows)  # raw id never leaks

For a quick CLI smoke check against a real export, count resource types before trusting the parser:

python -c "import ijson,sys,collections; \
c=collections.Counter(e['resource']['resourceType'] \
for e in ijson.items(open(sys.argv[1],'rb'),'entry.item')); \
print(c)" export.json

A sudden swing in that distribution (for example, an OperationOutcome count above zero, or a missing Patient type) is the earliest signal of an upstream server problem.

Gotchas and Compliance Constraints

value[x] polymorphism drops data. Reading a hard-coded valueQuantity discards every Observation whose value is a valueString, valueCodeableConcept, or valueBoolean. Branch on the concrete key present, and route the coercion through a single typed mapping so the rule lives in one place.
transaction/batch references are placeholders, not ids. Inside a write bundle, entries reference each other through fullUrl URNs that only become server ids after commit. Resolve against Bundle.entry.fullUrl first, and honor If-None-Exist conditional references so a retry does not create the same logical record twice. Contained resources (addressed by an internal #id fragment) vanish if you split the parent before resolving them.
PHI must never reach logs or analytics in the clear. Log only resourceType, id, and a structural hash — never the raw resource. Tokenize direct identifiers (MRN, SSN, DOB) at the ingestion boundary with a deterministic, keyed hash so HIPAA Safe Harbor or Expert Determination de-identification holds before data lands. Enforce TLS 1.2+ on every ingestion endpoint and write ingestion manifests to append-only storage. Codes that pass structural validation can still be semantically wrong: pin the system version and resolve each Coding against a FHIR terminology server so a SNOMED CT update does not silently break historical ICD-10 joins.

Troubleshooting

My Observation rows lose their numeric values after parsing. Why?

FHIR uses value[x] polymorphism, so the value sits under a typed key (valueQuantity, valueString, valueCodeableConcept, …). Code that reads a fixed valueQuantity silently drops every other variant. Branch on the concrete key present and coerce per type.

I get dangling references when parsing a transaction bundle.

Entries reference each other through fullUrl placeholders that only become real ids after the server assigns them. Index by entry.fullUrl before resolving, and honor If-None-Exist so retries do not duplicate records.

My parser runs out of memory on large exports.

json.load buffers the whole document. Iterate with ijson.items(f, "entry.item") on a binary handle, ensure no downstream step accumulates the full list, and for very large feeds switch the source to a bulk NDJSON export so each line is independent.

Row counts look inflated compared to the server's total.

You are counting _include/_revinclude entries as results. Filter on entry.search.mode == "match" for projection, and use include entries only to populate the reference index.

FHIR resource hierarchy explained — the containment-and-reference topology this parser walks (parent guide).
FHIR REST vs Bulk Data Export — choosing the transport that feeds the parser.
Type coercion for clinical data types — normalizing the value[x], date, and quantity values this parser extracts.
Using FHIR resource libraries for Python ETL — typed models instead of raw dictionaries.