Rendering 10M+ points from S3 to a map

Michal Migurski is working on a new project, rendering previews of OpenAddresses datasets in slippy maps. We’re using S3 to store stuff and trying for stateless servers. He just described the architecture plan to me, I’m writing it up here. He’s doing all the work, currently in a test git repo. The fun part is a FUSE file system for MBTiles on S3. Read on.

OpenAddresses is a collection of address data points. Big CSV files full of latitude, longitude, street name. We collect the data from government sources. Every time someone finds a new government source they create a pull request like this one for Luzern with metadata for the source; stuff like the URL and how to parse it into our format. Mike has some GitHub integration hooks that look at the pull request and renders a preview image of the data file. It looks cool, but it’s also a useful debugging tool. We’d like to transform that static preview image into a slippy map. Here’s how it’s going to work.

  1. GitHub integration hook: use Tippecanoe to boil down the address points into a tiled dataset. Write the resulting MBTiles file to S3 somewhere. Note this hook and processing is on a transient server that disappears once the processing is done.
  2. Web page: slippy map using Leaflet or the like to request OpenAddresses tiles and render the points in the browser.
  3. Tile server: persistent Flask server that uses TileStache to serve the MBTiles file to the web browser.
  4. Tile server: a FUSE filesystem that mounts the S3 MBTiles file and provides it to TileStache.

 

It’s that last step that’s particularly clever. Serving an MBTiles file locally is easy. But what do you do if your MBTiles file is on S3 somewhere? It might be quite large, 10 million points or 100 megabytes. But each map view only needs like 16 tiles or a megabyte of data. You’d rather not have to copy the whole thing from S3 first.

So instead of caching the whole file locally, Mike has written a simple read-only FUSE driver to remotely mount the S3 file. To normal Linux processes it just looks like a file, but behind the scenes read requests are being turned into HTTP Range-restricted requests to get the bytes and return them.

Why go through the hoops of FUSE? The challenge here is MBTiles are actually Sqlite databases, and Sqlite really wants to open an actual file down in the depths of its highly optimized C code. So we give it a file.

The big question here is performance. It seems to be OK on first testing! Sqlite should be pretty efficient about reading data. I’m a bit more concerned about llfuse, the Python FUSE driver framework Mike is using. It seems to have a single global lock so only one S3 read request can be active at a time, maybe per mounted MBTiles device. So this might not work so well; in practice multiple tile requests are happening in parallel, even for a single user looking at a single slippy map. But we don’t imagine many users so it may not be too bad.