Kiwix: Wikipedia offline

Just a shout-out to Kiwix, the software plus database for reading Wikipedia offline. It’s incredibly helpful to have on your cell phone when you’re traveling and don’t have a data plan. Also nice just to have lightning-fast browsing of Wikipedia. They now have clients for iOS, Android, Windows, MacOS, and Linux.

You have to download the data separately. Their ZIM format seems well thought out, compressed tight and general purpose enough you can get other document collections like Stack Overflow posts, TED talks, etc. English Wikipedia is an enormous 35Gb, 80Gb if you get images too. There are various Wikipedia subsets but no “top articles” opton I see. (Be careful not to download a Simple English version by accident.)

The most interesting thing to me is Kiwix is now partnered with Wikimedia Foundation. A nice cash infusion, but also gives Kiwix some legitimacy. I’m hoping that gives a second wind to the editorial project to produce a Wikipedia subset that’s smaller and well selected to be sufficient.

There are some other offline Wikipedia readers on the Apple app store. I used Wiki Offline for years but the company that produced that seems to have disappeared (again). There’s also Minipedia which does have a nice small subset you can download. I think all of these alternatives found it hard to produce regular updates with the most recent Wikipedia content. Kiwix seems to be doing quarterly releases.


x265 transcoding

I continue to re-encode video I get into x265. It’s much smaller, which helps both with storage and bandwidth. And it looks fine to my tin eye.

My script has evolved over time. This is what I do now:

ffmpeg -i “$1” -map 0 -c copy -c:v libx265 “${1%.*} x265.mkv”

In theory this leaves all the audio and subtitles alone, just copies them, but re-encodes the video in x265 with default settings.

I was in a hurry today and so got fancy with resizing the video to 360p and using a faster / less accurate encoding setting.

ffmpeg -i “$1” -map 0 -c copy -vf scale=-1:360 -c:v libx265 -preset faster “${1%.*} x265.mkv”

Quality definitely suffers for this, but I can encode at 3-4x playback speed on my CPU this way.

Linux: adding a drive, UUIDs

I replaced a 7 year old drive in my Linux server. The drive had been complaining about bad blocks in SMART forever (which still mystifies me; drives should be able to just remap those). But I was seeing other signs the drive might be failing, so better safe than sorry.

The thing that made this complicated is I wanted to be sure I understood how device names worked so I didn’t screw up which drive was mounted where. Old school names like /dev/sdb1 seem fine, but those names change if you remove drives, add new ones, etc. So I read up and learned the new hotness is to mount drives by naming their UUID, some stable unique string based on the drive partition’s geometry. /etc/fstab will take a UUID in the first column instead of a device name and happily mount it for you. That’s it, pretty simple. In particular udev is not involved and there are no symlinks required anywhere. I have no idea how mount finds the device named by UUID but it works, so I’m happy to remain ignorant.

I replaced the old 7200 RPM WD Blue with a 5400 RPM WD Blue. That’s kind of a cheap drive for a Linux server but I’m only using it as a backup volume. I keep being tempted to get an SSD for the main system volume.

Here’s the steps I followed for the new drive, mostly following this guide.

lshw -C disk: find the new hard drive. Easiest way is to match it via serial number, or other characteristics like size and model name. My new disk got named /dev/sdb, which awkwardly was what the disk I just took out was named too.

smartctl –smart=on /dev/sdb: turn on SMART for the disk. Honestly I don’t exactly know what this does but it seems like a good idea.

fdisk /dev/sdb: partition the disk. fdisk is the old school MBR partitioning, which is limited to 2TB maximum. My disk is 2TB so that’s OK. Newer systems use GPT (an EFI thing) and parted. I just made one large partition for the whole disk.

mkfs -t ext4 /dev/sdb1: make the filesystem. There’s some options here you could consider setting to get a bit more disk space or add checksumming to metadata but I stuck with the defaults. Fun fact: I was taught in 1990 to print out the list of superblock backups because if the disk failed it was the only way you were going to find them backup block IDs. I assume recovery tools have improved in the last 28 years. (Or more realistically, that the disk will be a lost cause.)

blkid | grep sdb1: find the UUID for the new partition

fstab: edit fstab to mount the new disk named by UUID.

All very easy really.


Flickr exports, fixup tool plan

Ahead of the Great Deletion, Flickr has a decent export tool built in to the user settings page. You click the export button, wait a day or two for an email, and then get some ZIP files to download.

I posted a little summary of what’s in the exports on Metafilter. Long story short, I think it’s pretty much all the data and metadata Flickr has. Here’s an expanded version of it:

Photos

  • 4 zip files are my photos
  • Photos are in JPG format with EXIF tags. I’m not positive but I believe these are the original bits I uploaded from my camera, or something similar. There is a lot of EXIF data intact.
  • Filenames are the title of the photo (downcased) plus the photo’s Flickr ID
  • File timestamps are bogus dates in 2013/2014

Metadata

  •  1 zip file with a bunch of JSON files
  • Most of the JSON files are one file per photo that includes Flickr metadata. Photo title, description, time, tags, geotags, comments on the photo, etc.
  • Several other JSON files with things like all the comments I’ve made, JSON data for my photo albums (a collection of photos), etc.

Conversion tool plan

I’m not aware of any tools that do much with this Flickr metadata, but I haven’t looked hard. I’ve considered writing my own minimal one with an eye towards extracting all the most important metadata from the JSON and stuffing it in EXIF tags. Ideally the resulting JPG files would then look reasonable when imported into Google Photos and/or Lightroom. Some specifics:

  • Set the JPG filename from the JSON photo title more nicely than Flickr did
  • Set the JPG file timestamp to the creation date in the EXIF data. If there is no EXIF timestamp, then take something from the Flickr JSON.
  • Insert Flickr’s JSON geotags from the photo into an EXIF geotag (if one doesn’t exist). I geocoded a bunch of photos by hand in Flickr, I’d really like to preserve that data
  • Insert Flickr’s JSON name and description tags into the EXIF in appropriate textual fields.
  • Insert Flickr’s tags into the EXIF; is there an appropriate field?
  • Capture Flickr comments from the JSON into the EXIF?
  • Flickr JSON has a “people” tag but I don’t think I’ve ever used it.

Logs of Lag outage

Oh boy did I screw up. Back on September 1 I pushed a tiny change to Logs of Lag, an old service I still run for League of Legends players. It was such a simple change, just adding an HTML link, so I didn’t test it carefully. Turns out I made that change on top of the master branch which had some other changes I’d committed but never tested or deployed back in 2015, and that new-but-old code didn’t work. The site’s been broken for 7 weeks now and I only found out when a user wrote me.

I love this commit comment past-me wrote:

Potentially breaking change: webapp use new stats. This change has not been tested manually, hence the warning in deploy.sh

I did read the commit log before pushing again, but apparently I didn’t read back far enough. Also I didn’t use my deploy script to deploy the server. Talk about shooting yourself in the foot. I even have a tiny bit of monitoring on the service but it didn’t show this kind of error, not that I pay attention to the monitor anyway.

The real problem is that I’ve abandoned this project; I haven’t done real development on it in 4 years. It’s kinda broken now too as the file format the code parsers has changed over time and I haven’t kept up. I’m now 100% fed up with League of Legends and Riot Games, given how sexist and awful that company is. So I have no motivation to do more. But the tool is still useful so I’ve tried to keep it online. (Past me did one thing right: the site is mostly just running static files, so it’s not hard to keep running.)

The site doesn’t get a lot of usage, my estimate last year was 50-100 unique users a day with about ~450 uses of the tool a day. Here’s a graph of the rate of usage; you can see an organic falloff over the whole year, then it falls sharply Sep 1 when I broke the site. I wonder if it will recover.

logsoflag2-year.png

PS4 6.02: more external storage woes

As I wrote in an earlier post, on a PS4 there’s no way to make a copy of a game’s download files to an external drive. You can move the files to a drive but they are then deleted from the source. Which is a huge PITA when the game requires a 90 GB download.

But it gets better. If you plug an external drive with a copy of a game into a PS4 that already has a copy of that game, the software freaks out. It insists you delete one of the two copies before you can use the external drive. There’s no way to tell it to ignore the duplicate to, say, let you get at some other game on the external drive. You must delete first.

So not only can you not create copies of games, but if you screw up you’ll be forced to delete a copy you downloaded. Argh!

(I reiterate; none of this is about copy protection; the PS4 online DRM will prevent you from playing the game if your login doesn’t have a license for it, whether you have a copy on the drive or not.)

PS: the external drive copying is awfully slow. a 50GB game image is taking 18 minutes, or about 50 MB/s. That’s just about USB 2.0 speeds. The drive, cable, and the PS4 itself are all supposed to support USB 3.0. Maybe it’s the spinning drive speed limiting things.

Yet more Python ORMs

I continue to fail to use ORMs with Python. They’re just so complicated and mysterious. But maybe that’s really just SQLAlchemy and I should look elsewhere.

PeeWee popped up on my radar recently, an ORM explicitly designed to be small and simple. It looks pretty good. Also have heard a couple of people mentioning PonyORM recently. It seems far too magic.

Going even simpler, I should circle back and spend more time with requests and/or dataset. They’re not even really ORMs, just convenience wrappers for database rows, and that seems fine by me. Still bugs me they both depend on SQLAlchemy even if the actual interaction is minimal.