Thursday, September 19, 2019

How to bulk load 1.6 billion records into AWS RDS Postgres

The nightmare scenario.  We had a star schema on AWS Aurora Postgres (compatible with Postgres 9.6).  I designed the dimension tables with 32 bit autoincrement keys (the Postgres  SERIAL data type).  Everything was fine for a few months.  Then the data loads started failing: the sequences for the 32 bit autoincrement keys had hit their max values.

How could this have happened?  The dimension tables all had record counts under 100,000.  It turns out that we loaded the dimension tables using simple INSERT statements:


The 'ON CONFLICT DO NOTHING' is a cool Postgres feature.  It means that we don't have to select before inserting; just do batch inserts and let the database ignore any record that is already there.  The problem is that when you have an insert conflict, YOU STILL INCREMENT THE SEQUENCE; which is how we ran out of numbers.

Our fact table had about 1.6 billion records.  We looked at changing the data types in place but it seemed to take too long.  So we made a new database with a modified version of the star schema that uses 64 bit keys (BIGSERIAL data type) and started migrating the data from the old to the new database.  We were able to use pg_dump / pg_restore to migrate the dimension tables.  But we ran into a number of problems with the big fact table.

We initially tried to use ETL tools to migrate the facts in hourly and daily tranches. These loads would start out fast.  I would walk away optimistically; only to come back a few hours later to discover that the data loads had slowed to a crawl.  This was caused by RDS IOPS thottling.  Unless you are willing to pay for provisioned IOPS, RDS gives you 3 IOPS per GB per 30 minute period, with a max of 3000 IOPS; and throttles your I/O performance if you do too much I/O.  We had provisioned the DB with 100GB, intending to let it automatically grow.  But the IOPS thottle means you want to provision 1000GB at the outset to get the 3000 IOPS sustained allowance.  (More info on this can be found here under 'I/O Credits and Burst Performance'.)

Then we ran into timeout problems with pg_dump / pg_restore.  Long-running jobs were vulnerable to network disconnects.  So we had to set up .pgpass files so that we could run pg_dump with nohup (without needing to manually type a password).  We ran pg_dump using the directory format for the single fact table.  This got us a large compressed tab-delimited file with the 1.6 billion facts.  We tried to load this in the new schema using pg_restore, but it would consistently fail after running for a few hours.

So after some intensive web searches, we came up with the following solution:

1. Split the big file into smaller files with a million records each:

    gunzip -c 3942.dat.gz | split -l 1000000

    This gave us about 1600 files to load: xaa, xab, xac, etc.

2. Use psql to run the following script:

    CREATE TEMP TABLE temp_fact AS SELECT * FROM fact_table WITH NO DATA; 
    \copy temp_fact from 'xaa'
    TRUNCATE temp_fact;
    \copy temp_fact from 'xab'
    TRUNCATE temp_fact;

3.  We dropped foreign key constraints on the fact table before doing the loads, but retained a unique index that was required to detect duplicate records during the inserts. It took 8-12 hours to load all the data files using this approach and maybe another hour to rebuild the foreign key indices once the loads had completed.