Thursday, September 19, 2019

How to bulk load 1.6 billion records into AWS RDS Postgres

The nightmare scenario.  We had a star schema on AWS Aurora Postgres (compatible with Postgres 9.6).  I designed the dimension tables with 32 bit autoincrement keys (the Postgres  SERIAL data type).  Everything was fine for a few months.  Then the data loads started failing: the sequences for the 32 bit autoincrement keys had hit their max values.

How could this have happened?  The dimension tables all had record counts under 100,000.  It turns out that we loaded the dimension tables using simple INSERT statements:


The 'ON CONFLICT DO NOTHING' is a cool Postgres feature.  It means that we don't have to select before inserting; just do batch inserts and let the database ignore any record that is already there.  The problem is that when you have an insert conflict, YOU STILL INCREMENT THE SEQUENCE; which is how we ran out of numbers.

Our fact table had about 1.6 billion records.  We looked at changing the data types in place but it seemed to take too long.  So we made a new database with a modified version of the star schema that uses 64 bit keys (BIGSERIAL data type) and started migrating the data from the old to the new database.  We were able to use pg_dump / pg_restore to migrate the dimension tables.  But we ran into a number of problems with the big fact table.

We initially tried to use ETL tools to migrate the facts in hourly and daily tranches. These loads would start out fast.  I would walk away optimistically; only to come back a few hours later to discover that the data loads had slowed to a crawl.  This was caused by RDS IOPS thottling.  Unless you are willing to pay for provisioned IOPS, RDS gives you 3 IOPS per GB per 30 minute period, with a max of 3000 IOPS; and throttles your I/O performance if you do too much I/O.  We had provisioned the DB with 100GB, intending to let it automatically grow.  But the IOPS thottle means you want to provision 1000GB at the outset to get the 3000 IOPS sustained allowance.  (More info on this can be found here under 'I/O Credits and Burst Performance'.)

Then we ran into timeout problems with pg_dump / pg_restore.  Long-running jobs were vulnerable to network disconnects.  So we had to set up .pgpass files so that we could run pg_dump with nohup (without needing to manually type a password).  We ran pg_dump using the directory format for the single fact table.  This got us a large compressed tab-delimited file with the 1.6 billion facts.  We tried to load this in the new schema using pg_restore, but it would consistently fail after running for a few hours.

So after some intensive web searches, we came up with the following solution:

1. Split the big file into smaller files with a million records each:

    gunzip -c 3942.dat.gz | split -l 1000000

    This gave us about 1600 files to load: xaa, xab, xac, etc.

2. Use psql to run the following script:

    CREATE TEMP TABLE temp_fact AS SELECT * FROM fact_table WITH NO DATA; 
    \copy temp_fact from 'xaa'
    TRUNCATE temp_fact;
    \copy temp_fact from 'xab'
    TRUNCATE temp_fact;

3.  We dropped foreign key constraints on the fact table before doing the loads, but retained a unique index that was required to detect duplicate records during the inserts. It took 8-12 hours to load all the data files using this approach and maybe another hour to rebuild the foreign key indices once the loads had completed.

Tuesday, September 19, 2017

Gradle 4: Copy War to Directory

This took me 90 minutes to figure out and the internet was no help.

I was trying to copy the output of my gradle war project to a Tomcat webapps directory so I can run it in the debugger. 

I defined "tomcathome" in  So far so good.

Then after reading a bunch of confusing stuff about how to define gradle tasks, I came up with this:

    task deployTomcat (dependsOn: 'war') {
        doLast {
            mkdir "${tomcathome}/webapps/${}"
            copy {
                from zipTree(war)
                into "${tomcathome}/webapps/${}"

This task got an error:
Execution failed for task ':deployTomcat'.
> Cannot convert the provided notation to a File or URI: task ':war'.
  The following types/formats are supported:
    - A String or CharSequence path, for example 'src/main/java' or '/usr/include'.
    - A String or CharSequence URI, for example 'file:/usr/include'.
    - A File instance.
    - A Path instance.
    - A URI or URL instance.
So it doesn't like 'zipTree(war)'.  I modified the code to print 'war' and 'war.outputs'.  'war.outputs' is an instance of org.gradle.api.internal.tasks.DefaultTaskOutputs

So I googled this, found the source on github, found that DefaultTaskOutputs has a 'files' property of type FileCollection, and FileCollection has a property called 'singleFile'.  So I added more prints and found that war.outputs.files.singleFile gives me the name of the war file.

So this is the working task:

task deployTomcat (dependsOn: 'war') {
    doLast {
        mkdir "${tomcathome}/webapps/${}"
        copy {
            from zipTree(war.outputs.files.singleFile)
            into "${tomcathome}/webapps/${}"

Friday, February 3, 2017

Solution for missing /vagrant file system.

If you are messing around with Vagrant and VirtualBox, you will sometimes end up with a vm where the /vagrant shared file system does not mount properly.  Vagrant complains that the vboxsf kernel module is not loaded properly and gives you a mount command to run manually.  When you try to mount it manually with
sudo mount -t vboxfs -o uid=500,gid=500 vagrant /vagrant
You get this error:
/sbin/mount.vboxsf: mounting failed with the error: No such device
  According to this page, you have to reinstall the VBox guest additions with this command:
cd /opt/VBoxGuestAdditions-*/init
sudo ./vboxadd setup
However, for a generic box, it fails because gcc and the kernel sources are not installed.  So when you run into this problem on a Centos 6 box (specifically, a vm based on the bento/centos-6.8 vagrant box), here is the exact sequence of commands that fixes the problem:
cd /opt/VBoxGuestAdditions-*/init
sudo yum install gcc make kernel-devel
sudo ./vboxadd setup
Then rerun the special mount command to mount the /vagrant file system.

Thursday, September 1, 2016

Changing Custom Field Types in Redmine

At Brick Street, we use Redmine to manage engineering and product management. We host it ourselves on an AWS instance because we're cheap and because we want to integrate it with our other line-of-business systems. We recently upgraded from Redmine 2.1 to 3.3. In Redmine 2.1, we had some custom fields on Issue records that held text to include in our release notes. (We have a program that scans subversion logs for Issues, and then generates release notes by pulling the Redmine text for the issues fixed in the branch.)

In Redmine 2.1, these custom fields are 'text' type, which shows up in the Redmine UI as a single line (an HTML input element). In Redmine 3.3, they added a 'Long Text' type which shows up in the UI as a resizable multiline control (an HTML textarea element).

We wanted to convert the 'text' custom fields to 'long text'. So I played with our preproduction instance and reverse engineered the schema: The custom field definitions are in the CUSTOM_FIELDS table:

MySQL> desc custom_fields;
| Field           | Type         | Null | Key | Default | Extra          |
| id              | int(11)      | NO   | PRI | NULL    | auto_increment |
| type            | varchar(30)  | NO   |     |         |                |
| name            | varchar(30)  | NO   |     |         |                |
| field_format    | varchar(30)  | NO   |     |         |                |
| possible_values | mediumtext   | YES  |     | NULL    |                |
| regexp          | varchar(255) | YES  |     |         |                |
| min_length      | int(11)      | YES  |     | NULL    |                |
| max_length      | int(11)      | YES  |     | NULL    |                |
| is_required     | tinyint(1)   | NO   |     | 0       |                |
| is_for_all      | tinyint(1)   | NO   |     | 0       |                |
| is_filter       | tinyint(1)   | NO   |     | 0       |                |
| position        | int(11)      | YES  |     | NULL    |                |
| searchable      | tinyint(1)   | YES  |     | 0       |                |
| default_value   | mediumtext   | YES  |     | NULL    |                |
| editable        | tinyint(1)   | YES  |     | 1       |                |
| visible         | tinyint(1)   | NO   |     | 1       |                |
| multiple        | tinyint(1)   | YES  |     | 0       |                |
| format_store    | text         | YES  |     | NULL    |                |
| description     | text         | YES  |     | NULL    |                |
19 rows in set (0.00 sec)

The field_format value for text fields is 'string'. The field_format value for long text fields is 'text'.

The actual values of the custom fields are stored in the CUSTOM_VALUES table:

MySQL>; desc custom_values;
| Field           | Type        | Null | Key | Default | Extra          |
| id              | int(11)     | NO   | PRI | NULL    | auto_increment |
| customized_type | varchar(30) | NO   | MUL |         |                |
| customized_id   | int(11)     | NO   |     | 0       |                |
| custom_field_id | int(11)     | NO   | MUL | 0       |                |
| value           | mediumtext  | YES  |     | NULL    |                |
5 rows in set (0.01 sec)

The customized_id field is a foreign key to the record in the ISSUES table. The custom_field_id field is a foreign key to the CUSTOM_FIELDS record. The value field stores the custom field value. This query will show you all the custom field values for Issue 2001:

select * from custom_values where customized_id = 2001;

So it turns out that you can change the type of a custom field value by updating the custom_field_id field of the CUSTOM_VALUES record. In our Redmine, we changed the text fields to long text by following this procedure:
  1. Define a new long text field in the UI.
  2. Look in the CUSTOM_FIELDS table to get the id of the text field (e.g. id 5) and the new long text field (e.g. id 15).
  3. Run this update query to change types of all the custom field values:
    update CUSTOM_VALUES set custom_field_id = 15 where custom_field_id = 5;

  4. Delete the custom field definition of the old field (id 5) in the UI.
This approach works great for Text custom fields. List custom fields are harder because the custom field definition includes a list of possible values (in the CUSTOM_FIELDS.possible_values). We have a number of multiselect fields that contain lists of release version. My next task is to convert these to the new Version custom field type.

Monday, June 24, 2013

Why @OntarioTCU should investigate @BitmakerLabs

BitmakerLabs is a private training school that does a 9 week, $9000 intensive course in Ruby on Rails and related technologies. In Ontario, so-called Private Career Colleges are regulated by the Ontario Ministry of Training, Colleges, and Universities (MTCU) under the Private Career Colleges Act, 2005. This law is basically a consumer protection law that says the MTCU must review and approve Private Career Colleges before they can operate in Ontario. Today, Bitmaker announced they are closing (temporarily) in response to an investigation from the MTCU, and that they will be registering as a Private Career College in order to comply with the 2005 law.

Almost everyone in Toronto with a twitter account seems to be enraged by this blatant attempt to enforce consumer protection laws. But consider this:

BitmakerLabs claims their 9 week program will turn a non-programmer into a web developer. As someone who writes checks to companies who might hire BitmakerLabs' graduates, I am skeptical, to say the least. Apart from my skepticism, however, I wish BitmakerLabs well; if they are doing good work, they should have no problem registering with MTCU. But it's kind of silly to blame MTCU for doing their job.

Furthermore, compare this program with the Humber College Web Development Program, which is full time for three semesters for $8662 and no HST. Or the Udacity / Georgia Tech / AT&T online Computer Science Masters Degree for $6k which promises to offer a fully accredited masters degree. Your money may be better spent on longer programs that provide more time to internalize the skill set.

The background here is that the traditional model of higher education (low volumes at extremely high costs) is ripe for disruption. So we are going to be seeing a lot of interesting experiments in the next few years; experiments like BitmakerLabs, MOOCs, and a general disaggregation of the teaching, evaluation, and certification functions that have traditionally been provided by colleges and universities. What this means for consumers is that they will have to be a lot more careful until this disruption sorts itself out.

  1. Interesting background info on rationale and prior enforcement of the Private Career Colleges Act, 2005: Ontario Ombudsman Report.
  2. The United States has similar laws; e.g. NH Regs for Career Schools

Thursday, March 28, 2013

Friday, August 10, 2012

Influitive and Investing in Toronto Startups

Congratulations to Influitive for closing their $3.75 million round. I joined the Influitive board over a year ago during a convertible debt round; it's been the easiest board seat I've ever held since Mark Organ is such a great CEO. He does all the work and I come to the board meetings and eat sandwiches.

It was also my first investment in a Canadian startup, which has been an interesting experience. If you are an American reader, you may not be aware that Canada is this foreign country on our northern border and, as a foreign country, has a separate currency, banking system, tax regime, and media establishment. Toronto, in particular, is like the New York, Los Angeles, and San Francisco of the country since it is the center for the Canadian financial, media, and technology sectors. So there is a lot going on here, maybe more than in jolly old New England.

Toronto, and Canada in general, is actually a good place to start a company or for established companies to locate R&D and customer-facing business functions. NAFTA has eliminated most cross-border issues and I've found that the talent pool is great, they're in the same time zone, and labor costs are lower than in the US, primarily because employers don't pay for health care. The Canadian government also has a number of refundable R&D tax credits, and Canadian-controlled startups can usually leverage this to get an extra few months of cash at a critical time.