Ingesting semi-structured data without losing your sanity
JSON drift is the second most common production bug after timezone bugs. The defenses every staging layer needs.
The day Mongo added a nested object
A team I knew was syncing a MongoDB collection into Postgres. The pipeline pulled user.email as a top-level field for a year. One Tuesday, the backend team "improved" the schema: email became contact.email, nested inside a new object.
The pipeline didn't error — it just started writing NULL for everyone's email. Marketing's email campaigns that week landed in nobody's inbox. Nobody noticed for four days.
The defense is JSONB-first ingestion plus typed extraction. Land the whole document as jsonb so nothing is lost. Extract the fields you currently rely on into typed columns as a separate, testable step. New fields appear in the JSONB; existing pipelines keep using the typed columns; rename events show up in the type-extraction step's tests, not in production.
JSONB-first ingestion, drawn
Mongo doc { _id, name, email } ┐
Mongo doc { _id, name, contact: { email } } ◄ rename┤
↓
staging.mongo_users_raw
doc jsonb (lossless), loaded_at, source_id
↓ typed-extraction view
staging.mongo_users_typed
email = COALESCE(doc->>'email', doc#>>'{contact,email}')
↓ alarm fires when
marts.dim_user email null-rate spikes
Bronze stays lossless — every Mongo field, every nested object, every future column you didn't anticipate. The rename surfaces in the typed view's tests (the COALESCE catches both shapes today, but a third shape tomorrow shows up as NULLs in the typed column), not in production at 3 AM.
-- Land the whole document, lossless:
CREATE TABLE staging.mongo_users_raw (
doc jsonb NOT NULL,
source_id text GENERATED ALWAYS AS (doc->>'_id') STORED PRIMARY KEY,
loaded_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX mongo_users_raw_doc_gin ON staging.mongo_users_raw USING gin (doc);
-- Extract typed columns as a *separate* model:
CREATE OR REPLACE VIEW staging.mongo_users_typed AS
SELECT
source_id AS user_id,
doc->>'name' AS name,
COALESCE(doc->>'email', doc#>>'{contact,email}') AS email, -- handles both shapes
(doc->>'created_at')::timestamptz AS created_at,
loaded_at
FROM staging.mongo_users_raw;
-- Test the typed view (in dbt or Great Expectations):
-- - email IS NOT NULL for X% of rows
-- - created_at is parseable
-- - schema diff alert if doc->>'name' becomes NULL for >5% of rowspgvector for retrieval-augmented generation (RAG). Store embeddings (typically 1536-d for OpenAI, 768 for many open models) in a vector column; index with HNSW for sub-second nearest-neighbor lookup over millions of rows.
CREATE EXTENSION vector;
CREATE TABLE knowledge (id bigserial PRIMARY KEY, text text, embedding vector(1536));
CREATE INDEX ON knowledge USING hnsw (embedding vector_cosine_ops);
SELECT id, text, embedding <=> $query_vec AS distance
FROM knowledge
ORDER BY distance
LIMIT 5;
For small-scale RAG (≤10M docs) pgvector is enough. Beyond that, dedicated vector DBs (Qdrant, Pinecone, Weaviate) earn their keep with sharding and managed scaling.
Practice the JSONB-extraction pattern. The `events` fixture has a `props` JSONB column. Return the top 3 SKUs viewed, with their `view_count`, sorted by count desc then sku asc. Use `props->>'sku'` to extract.
Takeaway: land semi-structured data as jsonb (lossless), extract typed columns as a separate, testable view. JSON drift becomes a tracked event in the typed-extraction tests, not a four-day silent outage. pgvector for small-scale RAG; dedicated vector DBs when you outgrow Postgres.