Goodreads lost all my data, writeup here. This hammered home the value of data export services for making my own copy of stuff. I had an export from last July so the Goodreads loss wasn’t entirely catastrophic. Some notes on other data export services.
Data export is obviously a competitive problem for services. If I can’t get my data I can’t easily migrate to a different service. So evil companies want to keep you locked in. Consequently the data export products are often not very good. Countering that are laws like the GDPR and CCPA which give consumers rights to request their data. I’m finding most of the big services I care about do have a reasonable export service. A good hacker spirit also still pervades in some tech companies, they build export because they think it’s the right thing for users.
Google Takeout is the gold standard. The Data Liberation Front did pioneering work in convincing a company to provide export tools many years ago. The tool now works very well. I’m particularly impressed there’s a way to set up a new data export every two months; all the other services are one offs.
Most of these data export services can be slow. Various nonsense explanations (“for your security”). I cynically expect companies keep them slow to be awkward. But one insider I asked confirmed that they can also be slow for legitimate technical reasons. If you’ve got, say, a real-time messaging system then scraping back for 15 years of message history can be pretty hard on the datastore. Backup requests are seldom time critical so built-in throttles make some sense.
It’s a big problem to actually do anything useful with the data dump once you get it. Some of the services (Twitter) include little webapps to at least sorta browse the data. Some just give plain CSV or JSON dumps and you’re on your own. I was hoping to find a diversity of wonderful open source software that could consume these exports. But in retrospect of course not; building those is a lot of damn work.
There’s less organized community around data exports than I’d hoped to find. The IndieWeb world is a good starting point. Datasette is one general purpose data viewer; Dogsheep is a nice collection of tools to import various formats in to sqlite for Datasette. HPI is another interesting general purpose viewer toolset. Also FreeYourStuff which seems to include some scrapers for sites without export tools.
Some of the exports reveal a surprising amount of extra data I didn’t know the sites had. Feedly has a list of everything I’ve read with dates, for instance, which is kinda neat! Facebook has a huge amount of stuff, some alarming, it would take awhile to comb through what all they’ve collected on me. “apps_and_websites_off_of_facebook” seems to be surveillance capitalism in action.
A thread I need to follow up on is what the IndieWeb kids call POSSE: Publish (on your) Own Site, Syndicate Elsewhere. Make it so you own a copy of all the data as you create it. It makes a lot of sense, Cory Doctorow has written about how he does that. It’s a lot of work.
I spent some time today pulling backups for data of all the sites I could think to check. I’ll update this list as I discover more. It’d be awesome to automate refreshing these backups every month or three but given all the security and performance issues that seems difficult.
2856 ./pinboard 56696 ./23andme 601960 ./twitter 4524 ./metafilter 5772 ./feedly 744212 ./facebook 316 ./goodreads 24 ./letterboxd 1356 ./wordpress 8 ./reddit 4 ./google 8 ./yelp 8 ./amazon