inserting lots of data into a remote Postgres efficiently

I’m using a Postgres server in AWS in the us-east zone. My database client is running in California, so it’s about 150ms away. I do not expect this to be fast. However I don’t expect it to be super-slow either. Inserting 700 rows in a single transaction was taking 60 seconds. That’s absurd! (Rows are ~100 bytes each, this is not a lot of data).

Thanks to StackExchange I found a solution. Here’s some untested code that illustrates the idea:

# Clearly slow to insert one row at a time
for row in myData:
    cur.execute('insert into data values (%s,%s,%s)', row)

# Should be much faster, but isn't
cur.executemany('insert into data values (%s,%s,%s)', myData)

# This hack is much faster
dataText = ','.join(cur.mogrify('(%s,%s,%s)', row) for row in myData)
cur.execute('insert into data values ' + dataText)

That funky last idiom basically expands all of myData into one giant SQL string with multiple values. For my use case, that 60 second operation went down to 1.5 seconds. This hack works, but you’re going to bump into a query size limit at some point.

Note that mogrify() isn’t doing anything clever, it’s just doing safe expansion of SQL query strings to avoid Little Bobby Tables. (There’s a Unicode error lurking here; I switched to Python2 just to make this work.) Also note that transactions don’t seem to be to blame here. psycopg2 by default runs a transaction for every connection; nothing commits until you call commit().

I think what’s going on is that psycopg2 doesn’t optimize the implementation of executemany(). Internally it’s just doing something like a loop calling execute() for each row. I don’t have proof of that, but my research turned up other people saying that was the problem. (See here)

(As an aside, what is it with database adapters for various languages that make them so universally bad? psycopg2 isn’t terrible, but it’s hardly joyful Python software. At least it’s better than the mess we had with mysql for so many years. Back in the Java days we used to have a terrible time trying to talk to MySQL as well.)

3 thoughts on “inserting lots of data into a remote Postgres efficiently

  1. The driver does exactly what you suspect it does but I’m not sure it’s at fault. execute and executemany are features of the python db API, not underlying intrinsic features of relational database protocols. execute says ‘take an exact, arbitrary SQL statement, prepare it, typeconvert/bind params, run, return results’. executemany is ‘take an arbitrary SQL statement, prepare it, and then run it repeatedly for all the typeconverted/bound params you have’. The idea, presumably, is that this saves you crossing the runtime/native boundary multiple times and repeating pricey prepares when you have some statement you want to run many times with different data.

    What you are asking the API to do is to somehow know your arbitrary statement is an insert and then transform the statement you asked it to execute many times into some other equivalent single statement and execute it exactly once. When described this way, the reply on the driver mailing list of ‘that’s not really our job, go install sqlalchemy/whatnot’ seems a lot more reasonable. The only way this could possibly be implemented in the driver is for the driver to parse your sql. Your solution is not a hack, it’s the right thing, short of using a SQL-construction library. Postgres does not have a hardcoded query length limit, so you’re home free.

    I believe a big reason db driver apis kind of blow is that, at the baseline level, their functionality must be ‘take arbitrary SQL string and fill in the blanks’. Raw SQL, for its various merits, does not make for a good API in most languages. What’s worse is that the ‘fill in the blanks’ part is a complicated, finicky morass of often not-quite-accurate, not-entirely-bijectively-mapped marshalings, encodings, type conversions. I don’t mean object-relational mapping stuff but basic things like – no typical language has a type that usefully maps to something SQL-trivial like, say, VARCHAR(20).

    I sometimes wonder whether Mongo’s (and some other documenty/JSONy NoSQL dbs) early surge of popularity is in part attributable to the removal of all that – there are no types beside vectors, maps, strings, integers that fit in a 64 bit float and floats whose precision you don’t care about too much. Want something else? Deal with it in your lovingly hand-crafted application logic. You lose a lot of other things but it’s way simpler and works well enough in domains people care about.

  2. That’s a great perspective, Peter, thank you. I just learned so long ago in MySQL that executemany() was more efficient, I assumed all DB drivers worked that way. I wonder if the C client library API to MySQL has a call like I want and Postgres doesn’t? Hmm.

    I agree that the database driver mapping problem is complicated. OTOH it’s about 80% of the library’s job, and it’s a pretty well defined serialization problem, so it does seem solvable. Also identifying a query is an INSERT and optimizing its implementation can’t be too hard, can it? Maybe it is in the general case. Anyway, the confounding thing is most of these database drivers are using crufty C libraries for access, not a clearly defined wire protocol. Definitely agree that the new NoSQL stuff is approaching it from a clean new perspective, often with HTTP APIs. Which have their own problems but are at least tractable.

    I’ve really enjoyed using sqlite3 in Python, btw. sqlite only has a very few data types. Also it’s very apparent how simple the system is, how directly you’re manipulating a C library and a file. No confusing servers and concurrent connections and stuff! The Python API wrapper is also pretty well done. Overall it promises less and delivers it well.

  3. Out-of-my-butt guess, but I suspect the python API might have been strongly informed/influenced by the vagaries and details of MySQL. Notice that it doesn’t actually expose ‘prepare’ in any way, a staple of relational DB APIs. I’m not sure this is actually true without digging about but it neatly matches both what you and I are saying.

    In the bad(?) old days it was not unusual to write native proxies for higher-level language data access that could leverage what the crufty db-specific C API could do, up to and including SQL introspection. I think these might have actually behaved much more similarly to your expectations. File under regression!

    I had a blurb about sqilite3 that said nearly the same thing but took it out figuring it made it harder to politely describe ‘rant’ as ‘perspective’.

Comments are closed.