Duplicati remote backup notes

Some notes on Duplicati for remote Linux backups. Primary opinion: seems reasonable, needs better docs.

I need a new remote backup solution for my home Linux box now that CrashPlan is truly closing up. In the past I hacked up a a remote rsnapshot option but I wanted something more user friendly. From the Hacker News discussion it seems Duplicati is the consensus choice. The other option I should explore is rsync.net.

I was backing up about 50GB of stuff on CrashPlan. At Amazon pricing that’d be about $1.30 / month. rsync.net would be $4.00/month. I can probably do this for free now on a server I have lying around in a datacenter. The fact the blocks are encrypted makes this all much more reassuring.

Installing

Duplicati runs as a systemd service. It has a web GUI listening on port 8200 and some sort of schedule thing where it runs every few hours. It stores backups in some Duplicati-specific database in encrypted 50MB chunks. The nice thing about Duplicati is it can store those data chunks on a variety of offsite backends, including any FTP, SSH, WebDAV, or S3-like service. Also support for specific services like AWS, Dropbox, etc.

Installing was kind of awkward: I followed the instructions for this headless install since the Debian/Ubuntu package they provide apparently requires an X environment. Even so I still had to install Mono, which is an awful lot of packages. Looks like the thing is written in C#.

Configuring seemed simple. I’m starting with just backing up my source code to local disk. I have a convention of putting some files in “nobackup” directories if they are huge downloads I don’t want to back up. I added a filter for that, “-*/nobackup/”. There’s also a default set of Linux filters which seems to be about not backing up any system files. Including stuff like /etc which honestly, you probably want backed up. But it seems reasonable for backing up home directories. Half-tempted to not back up my virtual environments for Python; I’ve got a bunch of 500MB monstrosities. But then it’s a PITA to rebuild them and it’s safer to just back up everything.

I made one config mistake which was to enable throttling of bandwidth. This applies even to local disk backups. I do want to throttle for network backups eventually.

Running

Anyway, set it all up and started it running. Seems to be doing something, judging by the 100% CPU usage of the mono-sgen process running Duplicati. The docs mention everything being compressed so I guess that’s where the CPU is going.

I tested this with about 19 gigabytes of files, 9 gig to be excluded by the nobackup filter. First run took 30 minutes. Duplicati said it was 9 gig to backup and 5 gig stored, which seems about right.

Second run with basically no changes took 1 minute. Backup directory expanded by about 5 MB.

A restore of a 770MB directory took less than a minute. It restored everything right, including timestamps and file permissions.

Remote backup

The local disk test went so well I went ahead and set up an ssh remote backup to a Linux server I own. I created a new user on that system, the configured Duplicati to back up to that host with a saved username / password. (There’s an option for ssh keys too). That’s about all I had to do, it’s just backing up as I speak. I did set up a network throttle at 400 KBytes/second. That seems to be consuming 3.46Mbits/ssecond, so there’s 260kbps in overhead. Probably TCP. CPU usage on the backup process is mostly about 3% when running throttled like this, with brief bursts of 100% activity. A second backup and a restore both worked fine.

Opinions

I like the product! It works well and simply. It could probably replace what I use rsnapshot for as well as my remote backups.

The documentation for the project is pretty poor, with stuff spread out over a few articles, wiki pages, and forum postings (!). Par for the course for free software. Also kind of a slow development process, it’s been 2 years+ for the 2.0 and it’s only sort of in beta now. OTOH it all seems to work, and is free, so I shouldn’t be complaining.

I’m a little nervous about my backups being in some unknown database format. OTOH the code is open source, absolute worst case presumably some nerd could figure out how to solve any problem.

 

Blu-Ray to x265

I am super excited about the Vinegar Syndrome re-release of Liquid Sky. One of my favorite movies from the 80s, most of us watched it on crappy 3rd generation VHS tapes or if we were lucky, a low quality DVD. Vinegar Syndrome remastered it and put out a beautiful Blu-Ray with gorgeous color and lots of extra features. Only problem: it’s a disc. Who can play discs these days? So I ripped it. (For myself only, I respect Vinegar Syndrome far too much to give out copies.)

Pasted image at 2017_12_12 02_36 PM.png

Blu-Ray discs are encrypted so most computer tools can’t read them. But MakeMKV can. It’s a straight up rip; it copies all the tracks off of the Blu-Ray and puts them in MKV containers on your hard drive. No re-encoding so the 2 hour movie is like 29 GB.

Re-encoding it to a smaller file is easy. Handbrake will do the job fine but for some reason was running slowly for me, so I used ffmpeg instead. Here’s the command:

ffmpeg -hwaccel auto -i title00.mkv
-map 0:v -c:v libx265 -crf 26 -preset slow -tune grain
-map 0:a:0 -c:a:0 aac
-map 0:a:1 -c:a:1 copy
-map 0:s:0 -c:s copy
ls.mkv

I’m doing this the complicated way with -map options so I can pick and choose audio tracks. Track 0 is the main audio, track 1 is director’s commentary. There’s also a track 2 which is the sound without dialog, but I decided not to include it. Just the one video track of course, and one subtitle track.

I transcoded the main audio. The source is in 768kbps DTS, overkill particularly since the audio is mono (as the original film was). So I’m re-encoding it to AAC. I have no idea what bitrate that was going to give me. The result is about 150kbps which seems reasonable, if generous.

I also transcoded the video to H.265, the result is about 3300kbps for 1920×1080. crf 26 is higher quality than the default crf 28. Preset slow trades off CPU time for higher quality / smaller video. And “tune grain” tweaks the encoder to try to preserve film grain, something abundantly visible in the source disc.

I wanted to transcode the subtitles to SRT but that didn’t work for me, ffmpeg threw an error.

The metadata mostly came through but then some of it is wrong; bitrates and stuff were copied from the Blu-Ray. Chapter marks did come through from the original source.

All told it took about 18 hours or about 0.1x real time speed. That slow preset really makes it slower, something like 4x slower.

Hardware accelerated ffmpeg transcoding

Since I have a Linux VM with a big GPU lying around I thought I’d take a quick stab at hardware accelerated video transcoding using ffmpeg. Quick results:

  • software decode, h264 software encode: 0.355x real time
  • software decode, h264_nvenc hardware encode: 2.7x real time
  • cuvid hardware decode, h264_nvenc hardware encode: 2.8x real time

So the hardware h264 encoder is about 7x faster than software. Hardware decoding is nice but not a huge improvement.

This is all on a 1 vCPU, K80 Google Compute server. Crappy little CPU and big beefy GPU. Normal desktop machines probably have less of a spread in results.

I really wanted to test this with H265 encoding, but the hevc_nvenc encoder does not work with the Kepler-class K80 hardware. I’m regularly doing H264 -> H265 transcoding at home now and on my i7-2600 it transcodes 720p video at just about real time.

Between the video transcoding and the machine learning I’m itching to own my own Linux box with a GPU in it. I could add a GPU to the existing server but it seems to be $200 minimum. The old server is 6 years old, maybe it’s time for a full hardware upgrade. Go for SSDs. Ubuntu 17.10 while I’m at it; these LTS releases are reliable but limiting.

I wonder if there’s a way to use the on-chip Intel GPU more? Probably not worth the hassle.

 

Leela Zero, GPU machine learning on Google Cloud

I’ve been excited about the Leela Zero project that’s doing machine learning with volunteered computers for the game Go. I’ve been running it on my Windows box for a few days but wanted to try running it on Linux.  So I leased a Google Compute server with a GPU.

End result is costs about $0.10 / training game on Google Compute. It might be possible to improve this by a factor of … 4? 10? … by picking a better hardware type and tuning parameters. My Windows desktop machine with beefier hardware is using about $0.01 / game in electricity. AlphaGoZero trained itself in under 5M games. If Leela Zero does as well it’d cost well under $1M to train it up to superhuman strength. Details vary significantly though; Leela Zero is so far not learning as fast. And the time to play a game will go up as the network learns the game.

Here are my detailed notes. This is my first time setting up Google Cloud from scratch, and my first time doing GPU computation in Linux, so all is new to me. Nonetheless I got Leela Zero up and running in under an hour. One caveat: Google Cloud gives new users a $300 free trial. But you cannot apply that balance to a GPU machine. The cheapest GPU machines are about $0.50/hour.

Update: someone on Reddit notes you can get similar machines from Amazon EC2 at a $0.23/hour spot price. They have more CPU too, so maybe it gets down to $0.03/game?

 

Setting up a Google Cloud machine

The only subtle part of this is configuring a machine with a GPU.

  1. Create Google Cloud Account. Attach a credit card.
  2. Create a new Project
  3. Add SSH keys to the project
  4. Create a server
    Compute Engine > VM instances
    Create an instance. Choose OS (Ubuntu 17.10) and add a GPU. GPUs are only available in a few zones.
  5. I picked a K80, the cheap one; $0.484 / hour. The P100 is 3x the price.
  6. Got an error “Quota ‘NVIDIA_K80_GPUS’ exceeded. Limit: 0.0 in region us-west1.
  7. Upgrade my account past the free trial.
  8. Try again, get same error
  9. Go to quota page. Find the “Edit Quotas” button. Request a single K80 GPU for us-west1. Have to provide a phone number. This seems to be a manual request that requires human approval, but it was approved in just a minute or two.
  10. Try a third time to set up a machine. Wish that my template machine had been saved. Works!
  11. Log in to the machine via IP address. It’s provisioned super fast, like 1 minutes. Ubuntu had already been updated.

Setting up GPU drivers

Mostly just installing a bunch of Linux packages.

  1. Try seeing if I can do GPU stuff already
    # apt install clinfo; clinfo
    Number of platforms 0
    Nope!
  2. Figure out what hardware I have
    # lspci | grep -i nvidia
    00:04.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
    This is a late-2014 $5000 GPU. Retails for $2500 now. It’s got 24GB of VRAM in it compared to gamer cards’ 6GB or so. It’s really two GPUs in one package, but I think I’m only allowed to use one? Probing shows 12GB of GPU memory available to me.
  3. Follow the instructions for installing drivers from Google.
    This boils down to installing the cuda-8-0 package from an NVIDIA repo. It installs a lot of crap, including Java and a full X11 environment. It took 6 minutes and my install image at the end is about 6.2B.
    Note there are many other ways to install CUDA and OpenCL, I’m trusting the Google one. Also there’s a cuda-9-0 out now but I am following the guide instead.
  4. Enable persistence mode on the GPU. I have no idea what this means; isn’t it delightfully arcane?
    # nvidia-smi -pm 1
    Enabled persistence mode for GPU 00000000:00:04.0.
  5. Verify we now have GPUs
    # clinfo
    clinfo: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libOpenCL.so.1: no version information available (required by clinfo)
    Number of platforms 1
    Platform Name NVIDIA CUDA
    Platform Vendor NVIDIA Corporation
    Platform Version OpenCL 1.2 CUDA 9.0.194
  6. Further optimize the GPU settings based on Google docs.
    # sudo nvidia-smi -ac 2505,875
    Applications clocks set to “(MEM 2505, SM 875)” for GPU 00000000:00:04.0
    All done.
    # nvidia-smi –auto-boost-default=DISABLED
    All done.

Compiling and running Leela Zero

Mostly just following these instructions and these instructions

  1. Install a bunch of libraries. Took about a minute, 500 MB.
  2. Compile leelaz.
  3. Run leelaz. It shows you an empty Go board.
  4. At this point you’re supposed to hook up a fancy graphical Go client. But screw that, we’re hacking.
    play black D3
    genmove white
    genmove black
    Here’s a sample output from one move
  5. Compile autogtp
  6. Copy the leelaz binary into the autogtp directory
  7. Run autogtp. Here’s a sample output

Performance

I didn’t benchmark carefully but I put this here because it’s most likely to be of general interest. Run times are highly variable; my first game on Google Cloud took 720 seconds for 270 moves, my second game lasted 512 moves (!) and took 1028 seconds. So comparing time / game for just a few games is not useful. Perhaps the ms/move numbers are comparable or at least useful for finding optimal work settings, but even they seem highly variable in ways I can’t understand. Benchmarking this for real would take a more serious effort.

  • Google Compute. 1 vCPU, K80 CPU.
    1 thread: 63% GPU utilization, 800 seconds / game, 2650 ms / move
    2 threads: 69% GPU utilization, 1200 s/game, 2500 ms/move.
    4 threads: 80% GPU utilization, 1113 s/game, 2110 ms/move
    10 threads: 57% GPU utilization, ?
  • Windows desktop. i7-7700K (4 cores), GTX 1080 GPU
    1 thread: 612 s/game, 1700 ms/move, 30% GPU
    2 thread: 389 s/game, 1120 ms/move
    3 threads: 260 s/game, 740 ms / move. 70% GPU
    4 threads: 286 s/game, 701 ms / move. 84% GPU
    5 threads: 291 s/game, 750 ms / move
    8 threads: 400 s/game, 740 ms / move, 80% GPU

Bottom line, I’d say the Google Compute systems are roughly 800 seconds / game, or 5 games an hour. That pencils out to about $0.10 a game. My Windows box with better hardware is about 3-4 times faster. I’m guessing it uses about 200W (didn’t measure), which is about $0.08 / hour or < $0.01 / game in electricity costs.

I’m confused about the performance numbers. I think those are normalized by number of simultaneous games (the -g parameter), so lower numbers are always better. The seconds/game number is highly volatile though since game length varies so much. I guess the ms/move parameter goes down on the Linux box with more threads because we use more of the GPU? But why not the same pattern on Windows? FWIW the program author has noted Windows performance is not great.

Leelaz seems to only use 63MiB of GPU memory, so a very low RAM graphics card is probably fine.

One last thing: I’ve been running the Windows .exe binary under Windows Services for Linux. 4 threads in this environment is 710 ms / move. 4 threads in a DOS window is 777 ms / move. Not sure it’s significant, but seemed worth noting.

 

soup-to-nuts Jupyter notebook project

I just finished a short project I was working on, an analysis of data from League of Legends about how people are using the new rune system the game just introduced. The report is here and speaks for itself; writing here to talk about how I wrote this up. For context the project consisted of scraping about 250 web pages off a site, parsing them, stuffing data into a database, then analyzing it for patterns.

I decided to do this project as a purely Jupyter notebook project. In the past I’ve used notebooks frequently for data exploration but seldom as my full workflow. In particular it’s not clear notebooks are well suited for production scripts like “scrape this website” but I decided to do it anyway, with some success.

Some notes:

  • IPython has a thing where if an output cell is too tall it will make it smaller with an internal scrollbar. You can adjust that size with some simple Javascript.
  • I wish there were a way to hide cells. I want to keep some code around but I don’t want it to show in whatever report I’m publishing.
  • I hid the code entirely so civilians don’t get overwhelmed. I did this with a “toggle code” button borrowed from here.
  • I wish notebooks had explicit named checkins. I find the commit cycle of git is good discipline for getting me to do only one thing at a time. There’s a Checkpoint feature in notebooks which is similar but it doesn’t prompt you for a comment so I never use it.
  • It seems worth noting that my virtualenv for the project was 550MB. Pandas is big but I wanted it. Somehow I ended up with all of scipy even though I just wanted a few colormaps from Seaborn.

Scraper

I wrote the scraper with BeautifulSoup. As noted before there’s a nice addon for Jupyter that gives notebook previews of scraped data. Interactively working towards robust scrapers was really nice in a notebook. I ended up creating a main loop to scrape all the pages I needed and write them to a simple shelf database. The full scrape only took a few minutes so I just ran it interactively; this wouldn’t work nearly as well for a run that takes hours or is perpetual. One clever thing I did was create a new shelf database for each run, so I could keep old data easily.

Data processor and database creator

My second notebook loaded the scraped data from a shelf and processed it down to a more query-friendly SQLite database. Like shelf files I kept one timestamped database per run so I could easily go back to old data. Interactive development for this part was nice, particularly being able to do some very simple sanity check reports along the way.

I used the dataset module to create the database for me. It’s quite nice; create a list of dicts and it just makes a schema from them and writes the data, no fuss.

Ad hoc queries

For my primary data exploration I used ipython-sql to run SQL code directly as Notebook cells, no Python requires. This was my favorite use of notebooks and no surprise, it’s exactly the kind of work Jupyter is for. Write query, see data as HTML table, done. ipython-sql allows you to mix in Python code, you can capture the results of queries as result sets and do things with them. I started trying to get fancy with that and realized that wasn’t working very well; better to stick to mostly SQL. Also the presentation options are very limited, once I started thinking about sharing this data with other people I wanted to give it more polish.

Final report

For my final report I made a new notebook using Pandas DataFrames with sql queries developed in the ad hoc notebook. Mostly used Pandas as a better table presenter; it makes it easy to round off numbers to 2 digits or color data cells based on their contents. Also I ended up using Pandas Python code to manipulate my data after it came from the database, convolutions that would be awkward to do with SQL queries. This all worked fairly well but in retrospect part of me thinks I should have gone with styling JSON blobs using D3 instead. That’s more work though.

The result of this sheet is the report itself and while it’s a fine Notebook it’s not a very nice deliverable for my target audience. LoL players are used to infographics like this, not boring tables. I’m uninterested in doing that kind of presentation though, so I stopped there.

Update: I posted this to Reddit and got on the front page, as much attention as I expect for a project like this. OTOH most of the discussion doesn’t seem informed by the data I presented. Good ol’ Internet people. Lesson for me in the value of presenting simple conclusions, not data.

simple database exploration

Following up on an earlier post wanting to be able to do database things simpler, some nice results.

I’m liking dataset as a quick way to write data from Python into a database. Basically you just insert dicts into a table; it takes care of generating a schema, selecting types for things, etc. I’m really liking it for writing data to databases. It helps read data too! But it doesn’t do much other than wrap rows in dicts which is nice, but I wanted more. (It also has support for fancy Python queries, but I just want raw SQL).

For reading data I’m liking ipython-sql. It adds a nice Jupyter interface on top of a raw SQL repl like that provided by sqlite or psql. It’s quite simple to install. And since it’s all implemented as an IPython magic you can mix it in with Python code in your notebook. I’m mostly just using it as a way to display SQL results as HTML tables; simple, but very useful. It has some support for fancier stuff like plotting and exporting the result sets into Python objects, Pandas DataFrames, and CSV files.

 

Python shelve: db type could not be determined

I hit a funny bug with the Python shelve module. I’d write a shelf database, then try to open it again and get this error:

db = shelve.open('foo.db')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/shelve.py", line 243, in open
    return DbfilenameShelf(filename, flag, protocol, writeback)
  File "/usr/lib/python3.6/shelve.py", line 227, in __init__
    Shelf.__init__(self, dbm.open(filename, flag), protocol, writeback)
  File "/usr/lib/python3.6/dbm/__init__.py", line 88, in open
    raise error[0]("db type could not be determined")
dbm.error: db type could not be determined

Turns out this is my mistake. I created the shelf with code like this:

db = shelve.open('foo')

This creates a file named foo.db, a Berkeley DB file. And then I tried to open it:

db = shelve.open('foo.db')

See the error? The shelve module is appending the .db for me; I shouldn’t have added it myself to the filename. Simple enough mistake I made, confusing error results. It’d be nicer if the code threw an error or opened a new database file named foo.db.db, but instead you get this confusing error.

Python 3.6.3 on Ubuntu 16.04, which I think is libdb 5.3.

Note this error is different from an often-referenced bug in older Pythons not recognizing GDBM databases consistently (issue 13007).