Scaling FHIR Batch Processing with Apache Airflow

Multi-gigabyte FHIR Bundle exports break naive Airflow DAGs in three predictable ways: the scheduler tries to push payloads through XCom and saturates the metadata database, workers json.load() an entire bundle and get OOM-killed, and retried tasks re-create resources instead of upserting them. This page sits within Async Batch Processing for Large Datasets, part of the broader Clinical Data Parsing & Transformation Workflows pipeline, and gives the Airflow-specific implementation: streaming the bundle into object storage, fanning out with dynamic task mapping, and committing idempotent conditional PUTs — all under PHI-isolation controls that keep clinical payloads out of the scheduler database, the UI, and task logs.

The precise problem this page solves: how do you orchestrate ingestion of a 10 GB+ FHIR bundle on Airflow so that memory stays flat, parallelism is bounded by pool quotas rather than by how big the file is, and a re-run produces exactly the same target state as a first run?

Airflow Configuration — Quick Reference

The single most useful artifact for this topic is the configuration map. Clinical batch failures are almost always a misconfigured executor, XCom backend, or pool — not application logic. Set these before writing DAG code.

Setting	Recommended value	Why it matters for clinical batch
`executor`	`CeleryExecutor` or `KubernetesExecutor`	The `LocalExecutor` cannot isolate PHI workloads onto dedicated, network-restricted nodes.
`xcom_backend`	Custom S3/GCS backend (`OPT_*` adapter)	Default XCom serializes return values into the metadata DB; bundle manifests must live in object storage, not Postgres.
Pool (`clinical_etl_pool`)	Slots sized to PHI worker count	Caps concurrent PHI tasks regardless of how many chunks the bundle produces — this is the backpressure knob.
`worker_concurrency` (Celery)	`8` with `worker_prefetch_multiplier=1`	Prefetch > 1 lets one worker reserve many large chunks and bloat memory.
`max_active_tis_per_dag` (mapped task)	`16`	Bounds fan-out of `expand()` so a 20k-chunk bundle does not stampede the FHIR server.
`AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG`	`50`	Prevents a single ingestion DAG from starving the scheduler.
`retries` + `retry_exponential_backoff`	`5`, `True`, `max_retry_delay=10m`	Survives transient `429`/`5xx` from the FHIR server without a retry storm.
`execution_timeout` (per task)	`2h`	Bounds a stuck chunk; pair with checkpointing so retries resume, not restart.
`hide_sensitive_var_conn_fields`	`True`	Stops connection/variable PHI-adjacent values from rendering in the UI.

The architectural rule behind the table: Airflow is an orchestration layer, never a data-processing engine. Only references (S3 keys, counts, hashes) may cross XCom; payloads stay in encrypted object storage. The same NDJSON shape produced by a bulk data export is the canonical input assumed below.

Implementation Pattern — End-to-End Ingestion DAG

The DAG below is complete and runnable against Airflow 2.7+. It streams the bundle with ijson (so memory is bounded regardless of file size), writes fixed-size chunk manifests to S3, returns only the keys through XCom, then dynamically maps the transform and upsert stages. Resource IDs are derived deterministically so a re-run upserts in place.

from __future__ import annotations

import hashlib
from datetime import datetime, timedelta

import ijson
import orjson
from airflow.decorators import dag, task
from airflow.providers.amazon.aws.hooks.s3 import S3Hook

BUCKET = "clinical-etl-staging"
CHUNK_SIZE = 500
VOLATILE = {"id", "meta"}  # non-deterministic fields excluded from the id hash


def deterministic_id(resource_type: str, resource: dict) -> str:
    """Stable logical id: identical clinical content always hashes the same,
    so re-exports and task retries upsert in place instead of duplicating.
    The 64-char SHA-256 hex digest fits FHIR's 64-char id limit exactly."""
    canonical = orjson.dumps(
        {k: v for k, v in resource.items() if k not in VOLATILE},
        option=orjson.OPT_SORT_KEYS,
    )
    return hashlib.sha256(resource_type.encode() + b"|" + canonical).hexdigest()


@dag(
    dag_id="fhir_bundle_ingestion_v2",
    schedule="@hourly",
    start_date=datetime(2024, 1, 1),
    catchup=False,
    max_active_runs=3,
    default_args={
        "retries": 5,
        "retry_delay": timedelta(minutes=2),
        "retry_exponential_backoff": True,
        "max_retry_delay": timedelta(minutes=10),
        "execution_timeout": timedelta(hours=2),
        "pool": "clinical_etl_pool",
    },
    access_control={
        "Data Engineering": {"can_read", "can_edit"},
        "Compliance": {"can_read"},
    },
    tags=["fhir", "phi", "compliance:hipaa"],
)
def fhir_etl_pipeline():

    @task
    def split_bundle(bundle_path: str) -> list[str]:
        """Stream a multi-GB Bundle into fixed-size chunk manifests in S3.

        ijson emits one entry at a time (SAX-style), so peak memory is one
        chunk, not the whole bundle. Only the list of keys crosses XCom.
        """
        s3 = S3Hook(aws_conn_id="aws_default")
        keys: list[str] = []
        buf: list[dict] = []
        idx = 0

        def flush() -> None:
            nonlocal idx
            idx += 1
            key = f"chunks/manifest_{idx:05d}.json"
            s3.load_bytes(orjson.dumps(buf), key, BUCKET, replace=True)
            keys.append(key)
            buf.clear()

        with open(bundle_path, "rb") as fh:
            for entry in ijson.items(fh, "entry.item"):  # top-level Bundle.entry[]
                buf.append(entry)
                if len(buf) >= CHUNK_SIZE:
                    flush()
        if buf:
            flush()
        return keys

    @task(max_active_tis_per_dag=16)
    def process_chunk(chunk_key: str) -> dict:
        """Transform one chunk into deterministic conditional-PUT operations."""
        s3 = S3Hook(aws_conn_id="aws_default")
        raw = s3.get_key(chunk_key, bucket_name=BUCKET).get()["Body"].read()
        ops = []
        for entry in orjson.loads(raw):
            resource = entry.get("resource", {})
            rtype = resource.get("resourceType")
            if not rtype:
                continue  # incomplete entry — caught by validation, not retried
            logical_id = resource.get("id") or deterministic_id(rtype, resource)
            ops.append({
                "method": "PUT",
                "url": f"{rtype}/{logical_id}",
                "resource": {**resource, "id": logical_id},
            })
        return {"chunk_key": chunk_key, "count": len(ops), "ops": ops}

    @task
    def upsert(batch: dict) -> None:
        """Conditional PUT each op with an If-Match ETag. Task success is the
        commit point: a retried task safely re-applies the identical writes."""
        # client = FhirClient(base_url=..., verify=True)  # TLS 1.3 enforced
        # for op in batch["ops"]:
        #     client.put(op["url"], json=op["resource"])
        ...

    keys = split_bundle("")
    upsert.expand(batch=process_chunk.expand(chunk_key=keys))


fhir_etl_pipeline()

The HL7 v2 path is the same shape with a different transform: extract OBX-3 → Observation.code and OBX-5 → Observation.value[x] before building the conditional PUT, using the field positions in the HL7 v2 message structure breakdown. Whichever the source, coded values should be confirmed against active value sets via a FHIR terminology server before upsert, never against a hardcoded map.

Validation & Testing

Correctness here is verifiable, not aspirational. Two properties matter: chunking must be lossless, and IDs must be replay-invariant.

Use Airflow’s built-in dags test to run the DAG end-to-end against a fixture bundle without a scheduler, then assert the chunk count matches the entry count:

airflow dags test fhir_bundle_ingestion_v2 2026-06-26 \
  --conf '{"bundle_path": "/fixtures/sample_bundle.json"}'

Pin the deterministic-ID contract with a unit test — the same resource, with differing volatile metadata, must produce one stable ID:

def test_id_is_replay_invariant():
    obs = {
        "resourceType": "Observation",
        "identifier": [{"system": "urn:lab:acme", "value": "OBS-1"}],
        "valueQuantity": {"value": 5.4, "unit": "mmol/L"},
    }
    a = deterministic_id("Observation", obs)
    b = deterministic_id("Observation", {**obs, "meta": {"lastUpdated": "now"}})
    assert a == b              # volatile metadata must not change the id
    assert len(a) == 64        # full digest fits FHIR's id length limit

For production reconciliation, run a golden-dataset check on every run: compare source Bundle.entry counts against target record counts grouped by resourceType, and flag any logical ID written more than once with differing content — that is a pipeline defect, not a duplicate input. The deeper key-construction rules live in implementing idempotent clinical data loads.

Gotchas & Compliance Constraints

XCom is not a data bus. The most common failure is returning bundle payloads (or intermediate dicts) from a task. Airflow serializes XCom into the metadata database by default, so large returns cause DB write timeouts and worker ephemeral-storage exhaustion long before the FHIR server is touched. Return only keys and counts; persist payloads in object storage and configure a remote xcom_backend. Symptom to watch for: SIGKILL (137) on workers plus metadata-DB latency spikes during split_bundle.

hash() is unusable for idempotency. Python’s built-in hash() is salted per process (PYTHONHASHSEED), so the same resource produces different values across workers and across retries — generating a new resource every run. Always derive logical IDs from a stable digest (hashlib.sha256 over canonicalized bytes), and keep the full 64-character digest: truncating to save index space reintroduces birthday-bound collision risk, and a collision here means one patient’s observation overwrites another’s. Canonicalization must be byte-stable, which depends on consistent type coercion for clinical data types — 5.40 vs 5.4 or an offset-less timestamp hashes two ways and bypasses the upsert.

The whole DAG is in HIPAA scope, including logs and quarantine. Never log raw payloads, and never inline PHI into a dead-letter record — publish a hashed reference and store the raw chunk in encrypted (AES-256), WORM-protected object storage with least-privilege IAM. Emit structured audit logs keyed on dag_id, task_id, run_id, resource_type, and operation_hash, explicitly excluding Patient.identifier, Practitioner.name, and Encounter.location. Enforce TLS 1.3 on every FHIR call, route the PHI pool to availability zones that match jurisdictional residency requirements, and set hide_sensitive_var_conn_fields = True so connection metadata never renders in the UI.

For dynamic-mapping semantics consult the official Dynamic Task Mapping documentation, and align payloads with the HL7 FHIR Bundle Specification for downstream interoperability.

Async Batch Processing for Large Datasets — parent topic: the bounded-concurrency worker model this DAG orchestrates.
Implementing Idempotent Clinical Data Loads — sibling guide on deterministic key construction and replay invariance.
Using fhir.resources for Python ETL — R4 model validation for the transform step inside process_chunk.
FHIR REST vs Bulk Data Export — where the NDJSON/bundle input this DAG consumes comes from.

Scaling FHIR Batch Processing with Apache Airflow

Airflow Configuration — Quick Reference

Implementation Pattern — End-to-End Ingestion DAG

Validation & Testing

Gotchas & Compliance Constraints

Related