How we cut a fintech client's nightly sync from six hours to twelve minutes

The client is a Series B fintech with roughly $40M ARR, doing reconciliation between Stripe, an internal Postgres database, and Snowflake. Every night a job ran that pulled the day's charges, joined them against ledger entries, and produced a reconciled report for the ops team to review at 9 AM Eastern.

The job was finishing at 8:55 AM. On bad days, 9:10. The ops team was stuck waiting, and any failure pushed a full re-run into business hours, which doubled the AWS bill and made the engineers nervous.

They asked us for a quick audit. We told them we could probably get it under thirty minutes. We got it under twelve.

What we found in week one

The job was a single 2,200-line Python script. It had been written two years earlier by a contractor who had since left. Nobody on the current team had touched it. There were three obvious problems and a handful of subtle ones.

First, every run was a full reload. The script wiped the previous day's slice and re-pulled everything, even though 99.5% of the rows were identical. Stripe's API was being asked for 18 months of historical charges, every night, because somebody had forgotten to set a date floor.

Second, the script used pandas.read_json in a loop, parsing one Stripe object at a time into a DataFrame, then concatenating. Each concat doubled the memory footprint. By the time it got to month 14, the box was paging.

Third, the load into Snowflake was row-by-row INSERTs. We measured 600 ms per insert with a warm connection. Multiplied by 8 million rows, that math gets bad fast.

What we changed

We didn't rewrite the whole thing. Three targeted changes did 90% of the work.

The Stripe pull switched to incremental: we store the last successful run timestamp in a small Postgres table, and only ask for charges since that point. The first incremental run pulled 11,000 rows instead of 8 million. Subsequent runs pull about 9,000 a night.

Pandas came out entirely. We replaced it with a generator that streams Stripe objects, validates them with a small Pydantic model, and yields normalized dicts. Memory dropped from 14 GB peak to under 400 MB.

The Snowflake load uses COPY INTO from a staged Parquet file now. We write the normalized batch to S3, then a single COPY ingests it. That step alone went from 4 hours to 90 seconds.

The dbt models on top didn't change, except we added a couple of incremental materializations so the downstream marts also stopped rebuilding from scratch.

Results, with numbers

Wall-clock time: 6 hours, 4 minutes → 11 minutes, 47 seconds.
Snowflake compute spend: down 71% (compute hours dropped, not the per-second rate).
Reconciliation finished at 5:12 AM ET instead of 8:55. Ops team starts the day looking at finished numbers.
The script went from 2,200 lines to 480.

What we'd do differently

We spent two days early on chasing a theory that the bottleneck was Stripe's API rate limits. It wasn't. We should have profiled before theorizing. We now run a 30-minute py-spy capture as the first step on every performance engagement, before touching any code.

Most jobs that look "slow because of data volume" are actually slow because nobody has profiled them. The data is rarely the problem. The code that handles the data, almost always, is.

How we cut a fintech client's nightly sync from six hours to twelve minutes

What we found in week one

What we changed

Results, with numbers

What we'd do differently

More in this category

A two-week internal tool that saved an ops team forty hours a week

Got a topic or a build? Tell us.