archive.org and HTTP seeded BitTorrent

archive.org announced BitTorrent support for downloads of their 1m+ files. The way they’re doing it is interesting, with HTTP seeded BitTorrent, a widely supported BitTorrent extension. In addition to normal torrent chunk downloads from peers, the BT client can also fetch ranges of bytes from an HTTP or FTP URL that’s published in the torrent. So archive.org doesn’t need to do anything too complicated with seedboxes; they just publish .torrent files that reference their existing HTTP resources, and BitTorrent clients will be able to download them. Only the BT clients can also do the P2P thing to download chunks, helping take some bandwidth off of archive.org and solving flash crowd problems.

There’s an impedance mismatch between BitTorrent and HTTP. HTTP is for downloading single files, ideally all in one stream although with Range requests you can get chunks from the middle. A single torrent has multiple files in it, and it’s natural to download chunks out of order. The BEP explains how the chunk stuff works for HTTP serving.

The other mismatch is that the HTTP contents located at a particular URL can and do change. The file contents located in a particular .torrent can’t really change; they are named by a checksum of their content, so if you change the content then the torrent is invalid. archive.org punts on this problem and just says that sometimes they change the HTTP contents and so you need to use a fresh/up-to-date .torrent. Expedient, if not particularly clean.