Nelson's log

inserting lots of data into a remote Postgres efficiently

Update 2017: psycopg2 now has fast execution helpers. Not sure why they didn’t just improve executemany.

I’m using a Postgres server in AWS in the us-east zone. My database client is running in California, so it’s about 150ms away. I do not expect this to be fast. However I don’t expect it to be super-slow either. Inserting 700 rows in a single transaction was taking 60 seconds. That’s absurd! (Rows are ~100 bytes each, this is not a lot of data).

Thanks to StackExchange I found a solution. Here’s some untested code that illustrates the idea:

# Clearly slow to insert one row at a time
for row in myData:
    cur.execute('insert into data values (%s,%s,%s)', row)

# Should be much faster, but isn't
cur.executemany('insert into data values (%s,%s,%s)', myData)

# This hack is much faster
dataText = ','.join(cur.mogrify('(%s,%s,%s)', row) for row in myData)
cur.execute('insert into data values ' + dataText)

That funky last idiom basically expands all of myData into one giant SQL string with multiple values. For my use case, that 60 second operation went down to 1.5 seconds. This hack works, but you’re going to bump into a query size limit at some point.

Note that mogrify() isn’t doing anything clever, it’s just doing safe expansion of SQL query strings to avoid Little Bobby Tables. (There’s a Unicode error lurking here; I switched to Python2 just to make this work.) Also note that transactions don’t seem to be to blame here. psycopg2 by default runs a transaction for every connection; nothing commits until you call commit().

I think what’s going on is that psycopg2 doesn’t optimize the implementation of executemany(). Internally it’s just doing something like a loop calling execute() for each row. I don’t have proof of that, but my research turned up other people saying that was the problem. (See here)

(As an aside, what is it with database adapters for various languages that make them so universally bad? psycopg2 isn’t terrible, but it’s hardly joyful Python software. At least it’s better than the mess we had with mysql for so many years. Back in the Java days we used to have a terrible time trying to talk to MySQL as well.)