Codabra

Ingesting semi-structured data without losing your sanity

JSON drift is the second most common production bug after timezone bugs. The defenses every staging layer needs.

The day Mongo added a nested object

A team I knew was syncing a MongoDB collection into Postgres. The pipeline pulled user.email as a top-level field for a year. One Tuesday, the backend team "improved" the schema: email became contact.email, nested inside a new object.

The pipeline didn't error — it just started writing NULL for everyone's email. Marketing's email campaigns that week landed in nobody's inbox. Nobody noticed for four days.

The defense is JSONB-first ingestion plus typed extraction. Land the whole document as jsonb so nothing is lost. Extract the fields you currently rely on into typed columns as a separate, testable step. New fields appear in the JSONB; existing pipelines keep using the typed columns; rename events show up in the type-extraction step's tests, not in production.

JSONB-first ingestion, drawn

  Mongo doc   { _id, name, email }                      ┐
  Mongo doc   { _id, name, contact: { email } }  ◄ rename┤
                                                         ↓
              staging.mongo_users_raw
                  doc jsonb (lossless), loaded_at, source_id
                                                         ↓  typed-extraction view
              staging.mongo_users_typed
                  email = COALESCE(doc->>'email', doc#>>'{contact,email}')
                                                         ↓  alarm fires when
              marts.dim_user                                email null-rate spikes

Bronze stays lossless — every Mongo field, every nested object, every future column you didn't anticipate. The rename surfaces in the typed view's tests (the COALESCE catches both shapes today, but a third shape tomorrow shows up as NULLs in the typed column), not in production at 3 AM.

JSONB-first staging — the defense
-- Land the whole document, lossless:
CREATE TABLE staging.mongo_users_raw (
  doc          jsonb       NOT NULL,
  source_id    text GENERATED ALWAYS AS (doc->>'_id') STORED PRIMARY KEY,
  loaded_at    timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX mongo_users_raw_doc_gin ON staging.mongo_users_raw USING gin (doc);

-- Extract typed columns as a *separate* model:
CREATE OR REPLACE VIEW staging.mongo_users_typed AS
SELECT
  source_id                                         AS user_id,
  doc->>'name'                                      AS name,
  COALESCE(doc->>'email', doc#>>'{contact,email}')  AS email,  -- handles both shapes
  (doc->>'created_at')::timestamptz                 AS created_at,
  loaded_at
FROM staging.mongo_users_raw;

-- Test the typed view (in dbt or Great Expectations):
--  - email IS NOT NULL for X% of rows
--  - created_at is parseable
--  - schema diff alert if doc->>'name' becomes NULL for >5% of rows

pgvector for retrieval-augmented generation (RAG). Store embeddings (typically 1536-d for OpenAI, 768 for many open models) in a vector column; index with HNSW for sub-second nearest-neighbor lookup over millions of rows.

CREATE EXTENSION vector;
CREATE TABLE knowledge (id bigserial PRIMARY KEY, text text, embedding vector(1536));
CREATE INDEX ON knowledge USING hnsw (embedding vector_cosine_ops);

SELECT id, text, embedding <=> $query_vec AS distance
FROM knowledge
ORDER BY distance
LIMIT 5;

For small-scale RAG (≤10M docs) pgvector is enough. Beyond that, dedicated vector DBs (Qdrant, Pinecone, Weaviate) earn their keep with sharding and managed scaling.

Practice the JSONB-extraction pattern. The `events` fixture has a `props` JSONB column. Return the top 3 SKUs viewed, with their `view_count`, sorted by count desc then sku asc. Use `props->>'sku'` to extract.

Takeaway: land semi-structured data as jsonb (lossless), extract typed columns as a separate, testable view. JSON drift becomes a tracked event in the typed-extraction tests, not a four-day silent outage. pgvector for small-scale RAG; dedicated vector DBs when you outgrow Postgres.