2025 · 03 · 0411 min readdata engineering

Why I'd rather build the same Redshift query twice.

A defence of boring, idempotent ELT — why running the same SQL again on Monday morning should produce the same numbers, and what that actually costs to enforce in a real warehouse.

Every data team I've worked on eventually has the same argument. It starts on a Slack thread at 09:47 on a Tuesday, sounds like "the dashboard moved overnight again," and ends with someone — usually the most senior person — saying, almost as a confession: "yeah, just re-run the DAG and see."

That sentence has always bothered me. Not because re-running is bad, but because the only reason it's the answer is that we secretly don't trust our own pipelines to be repeatable. We don't trust that running the same SQL over the same source data will produce the same output. So we hit retry and hope the universe is in a better mood the second time.

This post is a long argument with that little voice. It is about idempotency as a first-class design goal in ELT — specifically in Amazon Redshift, because that's the warehouse I work in most days at Getir — and about the unglamorous patterns that make "just run it again" a feature rather than a prayer.

What "idempotent" actually means here

The word gets thrown around a lot, so let me pin it down. A pipeline step is idempotent when running it N times over the same input window produces the same output state as running it once. That is it. No more, no less.

It does not mean "fast". It does not mean "incremental". It does not mean "exactly once" in the distributed-systems sense — those are different properties and people conflate them constantly. A full TRUNCATE-and-rebuild can be perfectly idempotent and also incredibly slow. A clever incremental merge can be lightning fast and silently non-idempotent, drifting by a few rows each run until your finance team notices in a quarterly report.

The test I use: if I take a downstream table, drop it, then re-run the last 30 days of pipeline runs in their original order, do I get bit-for-bit the same table back? If not, the pipeline is not idempotent — it is just usually correct.

Why this matters more in Redshift than in your laptop

Redshift, like most analytical warehouses, gives you very few transactional guarantees across multiple statements in a long-running ELT job. You can wrap things in BEGIN ... COMMIT, but you are still composing dozens of separate SELECTs, INSERTs, and DELETEs against tables that are simultaneously being written to by ingestion jobs and read by hundreds of dashboards.

That means three boring failure modes show up constantly:

  1. Partial runs. A job dies at step 7 of 12. Steps 1–6 already mutated state. Re-running from the start now double-inserts.
  2. Late-arriving facts. The order table for 2025-03-01 gets a refund row on 2025-03-04. Your daily aggregate for the 1st is now stale, but your incremental job only looked at "yesterday".
  3. Retries on the orchestrator. Airflow / Step Functions / cron retries a task that already half-succeeded. The second attempt sees a different world.

Idempotency is not a purity exercise. It is the cheapest, most boring insurance policy against all three.

Pattern 1 — Window deletes, then insert

The single most useful pattern I keep reaching for is delete-then-insert over a window. It looks like this, simplified:

BEGIN;

DELETE FROM mart.fct_orders_daily
WHERE order_date BETWEEN :start_date AND :end_date;

INSERT INTO mart.fct_orders_daily
SELECT order_date,
       country,
       COUNT(*)                  AS orders,
       SUM(gross_amount_try)      AS gmv_try
FROM   raw.orders
WHERE  order_date BETWEEN :start_date AND :end_date
GROUP BY 1, 2;

COMMIT;

Three things matter here. One: the delete and the insert are over the exact same window. If you parameterize one and forget the other, you have just invented a new way to corrupt history. Two: the window is a closed interval on a deterministic column (order_date), not on ingestion_ts or _loaded_at — those move when you re-load. Three: it's all in one transaction, so partial failure rolls back cleanly.

This pattern is dumb. It is also, in my experience, the right default for ~70% of fact tables under, say, a hundred million rows per partition. The temptation to be clever about merge logic is almost always more expensive than the cost of re-aggregating a 7-day window.

Pattern 2 — Dedup with ROW_NUMBER(), deterministically

Source systems lie. CDC streams duplicate. Webhooks fire twice. The job of a staging layer is not to pretend they don't — it is to collapse them to a canonical row in a way that survives a re-run.

WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY order_id
           ORDER BY updated_at DESC,
                    _ingested_at DESC,
                    source_file_name DESC
         ) AS rn
  FROM raw.orders_cdc
  WHERE order_date BETWEEN :start AND :end
)
SELECT * FROM ranked WHERE rn = 1;

The whole game is in the ORDER BY. Notice that it has three tiebreakers, all of which are stable across re-runs. If you sort only by updated_at and two rows share a millisecond, Redshift will pick one arbitrarily — and the choice can flip on the next run, silently changing your downstream numbers.

Rule of thumb: every ROW_NUMBER() in a production ELT job should sort by enough columns that any sane re-run produces an identical ranking. Random tie-breaking is the slow leak you'll never debug at 2 a.m.

Pattern 3 — "Soft re-run" windows for late-arriving facts

Refunds, returns, support credits, B2B reconciliations. Anything that touches money and humans will, eventually, arrive late. If your daily job only ever re-processes "yesterday", you will quietly understate refunds forever.

The cheapest fix I've found is to make every fact-table job re-process a trailing window, not a single day:

Combine this with Pattern 1 and you get the property I actually care about: the last 35 days of any fact table are always a deterministic function of the latest snapshot of the source system, regardless of when the job ran or how many times. Older partitions are frozen, both for cost and for auditability.

Pattern 4 — Don't trust CURRENT_DATE

This one looks small and it is not. If your job has a line like:

WHERE order_date = CURRENT_DATE - 1

…then it is not idempotent. It is "idempotent within a single calendar day", which is different. Run it Sunday at 23:59 and Monday at 00:01 and you process two different windows. Backfill it three weeks later and you get nothing at all.

Every job I write now takes :run_date as a parameter, passed in by the orchestrator, and uses only that. The job becomes a pure function of (run_date, source_data). That is the actual unit you want to be able to replay.

What this costs

I want to be honest about the trade-off, because "idempotent everything" is also how you set fire to your Redshift bill.

The trade I keep making is: I would rather pay 1.5× the compute for a pipeline that I can drop, re-run, and trust, than save money on a clever incremental job that I cannot reason about at 3 a.m. when a stakeholder is asking why GMV moved by 2%.

The cheapest data pipeline is the one you can throw away and rebuild from a single deterministic function of the source. Everything else is a story you tell yourself about how the data probably got that way.
— a note to my future self

The dashboard test

Here is the test I would love every data team to run, once a quarter. Pick three of your most-viewed dashboards. For each: drop the underlying mart tables, re-run the last 60 days of pipelines from scratch, and compare every number on the dashboard before and after, to four decimal places.

If they match — you have an idempotent ELT layer, congratulations, you can sleep. If they don't — and the first time I ran this, they did not — the gap between the two is a precise, dollar-denominated measure of how much your stakeholders think they trust the numbers vs. how much they actually should.

That gap is what I am trying to close, one boring SQL pattern at a time.

redshifteltsqldata engineeringidempotency