Strava’s new heatmap, some observations

Strava just published a new map of all their data. 3 trillion GPS points turned into a heatmap. It’s lovely.

Pasted image at 2017_11_02 08_32 AM.png

They wrote a fantastic writeup of how they made the map. All custom software. It’s not really a “heat map” in the sense I use the term, in that it lacks heat diffusion that leads to blobbiness of a true heat map. It’s more of a spatial histogram. That’s just a quibble, but it helps me keep this visualization straight from the new Mapbox JS GL heatmaps I’m playing with.

The smartest thing they do is adaptive normalization of the colormap to spatially local data. They describe it in detail with the CDFs they calculate. Basically bright yellow indicates the highest number for a local region near where you’re looking now, not an absolute number for the highest value in the whole world. This allows seldom-visited areas to still have compelling and readable visualizations. You can see this in effect if you zoom in to Alcatraz which has relatively few visits; the brightness changes radically from z=15 to z=16. A trick worth stealing.

I was also struck at how all the tracks look nicely snapped to roads. Mobile phone locations are never that accurate. And they deliberately fuzz all points by 2 meters, why are the roads so sharply defined? I think they simply have enough points that the average that comes through visually really is the actual road. Neat! You can see this in a hyper-zoom on the Golden Gate bridge where you can see faint traces of tracks off the bridge, but the main bridge path is highlighted. Note also the little bump, that’s where you have to walk about 2m to the outside to avoid the tower. (I believe the two tone-line is because this is walking data, pedestrians tend to walk the east side.)

Pasted image at 2017_11_02 08_33 AM.png

 

Google Cloud Postgres connections

I’m using Google Cloud’s Postgres implementation. I thought I had it all working except my AppEngine server couldn’t connect to it in production. For dev work I’d just authorized my home IP address and connected directly to it from my dev box. But that didn’t work in production.

Turns out there’s complex docs for connection to Google Cloud SQL. Here’s what I had to do when following those instructions. (This was made significantly more different by intermittent outages either at Google or in my network connection; lots of failures and timeouts.)

  1. Download cloud_sql_proxy to my dev box
  2. Get some application default credentials on my dev box
    gcloud auth application-default login
    This does some OAuth thing with my Google account. In retrospect I have no idea how I was able to gcloud deploy, etc before doing this.
  3. Authorize my home IP address to connect to the SQL instance via the cloud console.
  4. Run the gcloud_sql_proxy and have it listen on 127.0.0.1:6432 (port 5432 is busy with my own Postgres instance.)
    ./cloud_sql_proxy -instances=projectname:us-central1:dbhost=tcp:6432
    Note the instance connection name has three parts separated by colons. “gcloud sql instances list” does not show you names in this format, but the console does in the “Instance details” page for that specific SQL server.
  5. Test I can connect to the database via local proxy
    psql postgresql://postgres:password@127.0.0.1:6432/postgres
    The proxy will log a line about the connection.
  6. Add the beta_settings / cloud_sql_instance stuff to my app.yaml. I have no idea what this does or if it’s necessary, but the docs said I should.
  7. In my dev environment, configure my code to connect to its database via 127.0.0.1:6432, so it uses the proxy. It’s not clear the proxy is necessary. I can connect directly to the database machine via IP address. I guess it’s nice in that the proxy is configured just with the instance connection string, not an IP address.
  8. In prod configure my code to connect to
    postgresql://postgres:{password}@/{dbName}?host=/cloudsql/projectname:us-central1:dbhost
    I have no idea what this ?host= argument does or why it works. Is this a standard thing or did Google fiddle with its Postgres drivers somehow?

That’s more or less the process described in Google’s docs, but it took me an hour of trial and error to make it work.

Update: I asked Stack Overflow about how cloud SQL connections worked and got a very quick answer from a Google Cloud engineer named Vadim. “It’s a UNIX socket backed by Cloud SQL Proxy”.

In detail, /cloudsql/projectname:us-central1:dbhost is a Unix domain socket, not a TCP socket. There’s an actual file (or directory?) named /cloudsql/projectname:us-central1:dbhost in the Unix machine’s filesystem. The ability to connect to Unix domain sockets with a special host= parameter is implemented in libpq, the Postgres connection library. The Cloud SQL Proxy code talks about how it can proxy from a Unix domain socket.

It’s a nice optimization; it’s silly to use a TCP/IP stack for passing data between two processes on the same computer. I assume Linux has a highly optimized localhost driver, even so Unix domain sockets should be better. This discussion suggests 1.5-2x the throughput.

simpler database access in Python

I continue to look for things that make working with databases simpler in Python. I keep doing projects where I end up dropping down to raw SQL, with awkward syntax for querying rows and bespoke one-off code for updating data in the database. The Python is messy but I like being able to use SQL in its full glory.

I’ve tried using SQLAlchemy a couple of times and have a hard time loving it. ORMs are a bad idea in general and SQLAlchemy is very complicated and magic. For a schema with 20+ tables and a lot of complex interactions it makes sense, but for a small project where you just want something more sane than low level DBAPI…

For my current project I’m trying to use plain psycopg2 and its NamedTupleCursor. It’s not awesome, but at least it gives you named access to the rows from the DB. So it makes reading from the database a little nicer, but does nothing for writing. It’s Postgres only.

Along these lines is the records library. It also gives sane access to results from queries; either as dicts or named tuples. Result sets also have nice export functions like “write to CSV” or “make into a Pandas DataFrame”. It doesn’t really do much for writing data, just querying, so it’s simple and small.

dataset is an effort to do even more. I love their tag line: “Because managing databases in Python should be as simple as reading and writing JSON files.” Amen! There’s a lot more going on in dataset than records. It’s more like an ORM-lite and contains support for creating schema, automatic upserts, complex queries without Python, … It doesn’t have the export functions records has, but it does have a lot of thought for simple queries and updates. I’d have to actually use it for something to see if it’s sensible or if ORM-lite is an even worse idea than a full ORM.

Both records and dataset depend on SQLAlchemy, that turned me off at first. But they’re both mostly using it as a way to support multiple database types. In theory the Python DBAPI standard should be enough, but in practice it sucks enough I understand why they’d build on SQLAlchemy instead. With records you can mostly ignore that SQLAlchemy is there, but dataset exposes it more in the query system.

I wonder if some other programming environment has a much nicer database interface. The Microsoft world has LINQ, but that seems to have all the problems of other ORMs. I wonder if the R folks have a really nice way to work with databases, or maybe even Pandas.

I really like Python 3.6 format strings

Now that I have access to Python 3.6 I’m really enjoying the new PEP 498 format strings. Python has tried a bunch of different ways to format strings, from good ol’ %s notation to Template strings to str.format. They all seemed awkward to me because the thing you were substituting in was way far away from the string itself. Now you just embed a little python code right in the string to be formatted, maybe with a bit of presentation like how many digits to display. Done!

It’s pretty powerful under the hood. check out this example:

>>> today = datetime(year=2017, month=1, day=27)
>>> f"{today:%b %d, %Y}"  # using date format specifier
'January 27, 2017'

The format string is using the datetime-specific codes like %Y. Behind the scenes it’s calling the format() built-in which in turn can defer to the __format__() function on the object. In practice I suspect you can mostly not worry about this too much other than using basic floating point precision stuff.

I used to do this kind of thing a lot:

print ("{date} {asin} {title}".format(**locals()))

That’s gross for several reasons. Now I can just do

print(f"{date} {asin} {title}")

and be done with it.

GeoAlchemy and SQLAlchemy thoughts

I’m doing a new geo-project, a GPS tracker sort of thing. My database is very simple, just a Users table and a Locations table holding points. But I figured I’d front-load the work and use SQLAlchemy and Alembic and PostGIS and everything right now, not pay the pain of migrating later when I realize I need the fancier stuff.

Only SQLAlchemy doesn’t support PostGIS. GeoAlchemy2 does though. After an hour with it I can’t decide if it’s good or bad. It seems very simple.

In GeoAlchemy2 all geodata is represented in Python by just a few types; Geometry, Geography, Raster. You can’t do much with those things other than get the WKB or WKT for them. They’re not Points or Polygons or something, there’s no Python code for working with the shapes. You can pass them back to PostGIS though, to call functions like ST_Area to have PostGIS do a calculation. Requires a database round trip though. Alternately you can also convert them to Shapely shapes, use that excellent GIS library to do work in Python.

I guess that’s a reasonable compromise; GeoAlchemy2 is really just bridging to SQL, they’re not going to implement their own GEOS equivalent. It also has spatial functions declared in a SQLAlchemy-friendly way you so you can use SQLAlchemy syntax to do your work. That’s kinda neat.

Still I’m looking at a lot of third party Python code: SQLAlchemy, Shapely, GeoAlchemy, Alembic. All to manage a database with two tables and nothing more complex than a list of points. I may blow it all off and go to raw SQL. OTOH I may end up regretting that choice in a few months if I want to do more complex things.

 

Firefox Quantum

I’m trying Mozilla’s new web browser, Firefox Quantum (aka Firefox 57). It’s in beta now, will be the primary Firefox release in a few weeks.

There’s a lot going on with this release but the primary feature is speed. It is super snappy and responsive. From simple tricks like taking UI actions faster in response to input, to deep things like a brand new CSS engine (in Rust) and a new process model. Firefox had been single process up to now, unlike Chrome. Now it’s multi-process but still using threads in a performance / resource usage tradeoff. Uses less memory too although since I never keep many tabs open that’s less of a problem for me.

One drawback of the new release is old Firefox extensions no longer work; developers have to make significant changes for the new process model. Most of the Chrome extensions I care about also have Firefox versions that have already been updated. uBlock Origin, LastPass (a beta), Imagus, Pinboard+.

Been awhile since I used Firefox. It’s still good. Some customizations that helped me:

  • Disable Ctrl-Mousewheel zooming by going to about:config and setting mousewheel.with_control.action;0
  • Customize the layout by enabling “Drag Space” so you have a few pixels of window frame above the tabs.

python 3.6 on Ubuntu 16.04 LTS

I want to use python 3.6 on my Ubuntu 16.04 LTS box; they only have python3.5. It’s pretty simple following these instructions but here’s the details.

Step 1: install python 3.6 from Jonathon Fernyhough’s PPA.

add-apt-repository ppa:jonathonf/python-3.6
apt-get update
apt-get install python3.6 python3.6-venv python3.6-dev

Step 2: set up a 3.6 venv

python3.6 -m venv venv
source venv/bin/activate
pip install -U setuptools pip
python -V

Now when you activate venv, you’ll automatically get python 3.6 in that environment.

In addition to the jonathonf PPAs there’s also the deadsnakes PPA which has old versions of Python as well as new ones.