I spent a few weeks building a small but runnable reference for ingestion-plus-transformation on
Databricks, and the most useful thing I learned wasn’t architectural — it was that two of the tools
involved have almost the same name, and that collision is not just a marketing annoyance. It shows
up in your sys.path.
So before anything else, the disambiguation, because the whole post depends on it:
dlt(lowercase) is dlthub, an open-source Python library that extracts from sources and loads into destinations.pip install dlt.DLT/ Delta Live Tables is a Databricks product for declarative pipelines, renamed in 2026 to Lakeflow Spark Declarative Pipelines.
This post is about the first one. I deliberately did not use the second one. The transformation
layer here is dbt, not Delta Live Tables. The shape I ended up liking is
boring and decoupled: dlt lands raw tables into a Unity Catalog schema, dbt reads that schema and
builds marts on top, and the only thing connecting them is a schema name. The full code is on GitHub:
AndreaBozzo/dlt-dbt-databricks.
The architecture in one line
dlt extracts from a source (a REST API or Postgres here) and loads raw tables into a Unity
Catalog schema → dbt reads that schema as a source and builds staging → intermediate → marts, all on a Databricks SQL warehouse.

There is no in-memory handoff, no shared file, no orchestration glue that has to be alive for the contract to hold. The boundary is a Unity Catalog schema. That single decision is what makes the two tools independent.
Outer layer: dlt as the ingestion fabric
dlt owns one schema — <catalog>.raw by default, plus a managed staging Volume in the same catalog for bulk COPY INTO — and writes nothing else. The headline
example is a declarative REST source with a parent→child relationship (posts, then each post’s
comments), loaded with merge so reruns upsert instead of duplicating:
| |
The things that make dlt a good outer layer are the unglamorous ones:
- Python-first. It reads like normal client code with decorators, not like fighting a distributed system to fetch JSON. The long tail of niche SaaS, internal REST endpoints, and “the vendor gave us this one weird API” is exactly where it shines and where managed connectors don’t.
- Stateful but boring. Incremental cursors, schema inference, and merge semantics are declarative.
The advanced examples in the repo show
write_disposition="merge"+dlt.sources.incremental(...)compiling down to an idempotentMERGE INTO, and schema contracts (evolve/freeze) that emit realPRIMARY KEY/FOREIGN KEYconstraints on the Databricks tables. - Polyglot. The same pipeline runs locally, in CI, or inside a Databricks job. There’s also a
Postgres source (
sql_database) and an Iceberg table-format variant — Databricks is one destination among several, not the center of gravity.
By the time dlt is done, you have Delta tables in a governed Unity Catalog schema — not a Hive
metastore, not an ungoverned bucket. That’s the precondition the whole rest of the design leans on.
The contract: a Unity Catalog schema, nothing more
This is the part worth being precise about, because it’s the entire point of pairing the two tools.
- dlt owns
raw. Each run loads into<catalog>.<DLT_DATASET_NAME>(defaultraw), alongside its own bookkeeping tables (_dlt_loads,_dlt_pipeline_state,_dlt_version). Leave those alone. - dbt reads
rawas asource, never as a hardcoded table. Asources.ymldeclares the schema, and staging models select from{{ source('raw', 'rest_posts') }}. The clever bit: the source schema resolves from{{ env_var('DLT_DATASET_NAME', 'raw') }}— the same env var the dlt pipelines read — so renaming the landing schema can’t desync the two sides. - dbt owns
analytics. Models materialize into<catalog>.<DATABRICKS_SCHEMA>. Nothing ever writes back intoraw.
Because the boundary is a schema and not a process, the two tools stay fully decoupled. You can run them on different schedules, from different machines, or one without the other, and the contract still holds. dbt doesn’t know or care that a Python project feeds its sources; dlt doesn’t know anything is reading them.

Inner layer: dbt, staging → marts
The transformation side is ordinary dbt, and the repo carries two distinct datasets through it.
The dlt-fed path. Staging is a 1:1 cleanup of the landed tables — rename and cast, no business
logic, since dlt already normalized camelCase → snake_case on the way in:
| |
A real, messy analytics layer. The more interesting models sit on top of
samples.healthverity.claims_sample_synthetic — a synthetic healthcare-claims dataset (~410k rows,
one row per claim service line) that ships in every Databricks workspace. It’s a good stress test
because the raw data is all strings, ~57% of line_charge is NULL, and a few thousand rows are
negative (claim reversals). Staging casts and flags that; an intermediate model rolls line-detail up
to one row per claim; the mart answers an actual business question — charged vs. allowed amounts by
payer segment:
| |
This is the honest division of labor: dlt is close to where HTTP and JSON live; dbt is close to where SQL, tests, surrogate keys, and incremental merge strategies live. Neither tool is asked to do the other’s job.

What actually broke
The intro promised this, so here are the real ones — none of them architectural, all of them the kind of thing you only find by running it.
The two dlts collide on sys.path. Databricks serverless ships a built-in dlt module — the
Lakeflow/DLT one — whose import hook shadows dlthub’s dlt. import dlt in a notebook can resolve to
the wrong package entirely. The fix in the repo is an ugly little shim that temporarily strips the
first sys.meta_path finder so the real package wins:
| |
If you ever doubted that the naming collision is a real problem and not just a pedantic footnote: it’s
six lines of meta_path surgery in production code.
Serverless doesn’t always have a SQL warehouse for dlt’s destination. dlt’s Databricks destination
wants a warehouse and a staging volume. Inside a serverless job that path isn’t always available, so
the REST pipeline has a --load-mode spark fallback that just does spark.saveAsTable. The catch:
that fallback must emit the exact same snake_case columns plus _dlt_load_id that dlt would,
or dbt’s staging models break downstream. Two code paths, one schema contract to honor.
Keeping dbt’s source schema in sync with dlt’s dataset name is the silent failure mode — point them at different schemas and dbt cheerfully builds nothing. Resolving both from one env var is what makes that a non-issue.
Orchestration: local first, then a Bundle
For demos there’s a small local runner (~50 lines) that calls dlt, then dbt build, in one process — enough to
prove the handoff. For production the repo ships a Databricks Asset Bundle: one Job, two tasks,
dbt_build gated on dlt_ingest succeeding, with versioned, deployable infrastructure instead of a
script.
| |

There’s also an offline doctor.py preflight that checks env vars, parses the dbt project, and runs
bundle validate without touching the warehouse — so a new contributor can tell whether the repo
is wired correctly before spending a cent of compute. Every example is validated against a real
workspace; no dead demos.
Why one Unity Catalog still matters
Even though the inner layer is dbt and not a Databricks-native pipeline, terminating everything into Unity Catalog is what makes the split pay off:
- Access control is on catalogs and schemas, not random buckets — analysts, ML engineers, and service principals all see the same permissions.
- Lineage traverses from dbt marts back through staging to the dlt-landed
rawtables — and on into whatever reads them — one graph, both tools. - One governance surface. Tables, views, functions, and any models you register all live under
the same
catalog.schemanamespace — one place for permissions, audit, and lineage, instead of separate stories for data, code, and ML.

That’s the actual reason to send dlt’s output straight into UC instead of staging it somewhere and re-importing later: it turns an ingestion library into a first-class citizen of a governed lakehouse, without writing a line of Spark.
A checklist, mapped to the repo
- Set up Unity Catalog. A catalog with a
rawschema (dlt) and ananalyticsschema (dbt), and a running SQL warehouse. (make doctorchecks the rest.) - Configure dlt.
dlt[databricks], point the destination at your catalog, model each source as arest_api_source/@dlt.resourcewithmerge+ incremental cursors where it makes sense. - Make the schema the contract. Resolve dbt’s
sourceschema and dlt’sDLT_DATASET_NAMEfrom one env var. Don’t let dlt write outsideraw. - Build dbt on top.
staging → intermediate → martsreadingrawas a source, materializing intoanalytics, with tests on the keys. - Promote orchestration. Start with the local runner; ship the Asset Bundle when you want it in production.
At the end you have one catalog boundary, two tools doing the jobs they’re actually good at, and a clear place to hang governance. It’s not the only way to use Databricks — but it stopped feeling like overlapping products and started feeling like a lattice where each piece does exactly the amount of work it’s good at. And no, none of it is “DLT.”
