MacOS unzip is ancient and busted

Tracked down an OpenAddresses problem with our primary data product, a 3GB zip file. MacOS unzip can’t unzip it, producing errors like this:

skipping: us/il/macon.vrt         need PK compat. v4.5 (can do v2.1)

The underlying problem is MacOS El Capitan (10.11.4) distributed the ancient unzip version UnZip 5.52 of 28 February 2005. This version does not work correctly on files > 2GB big. The usual recommended solution is to install p7zip via Homebrew and use that to unzip the big archive. A lower impact option is to install the Homebrew dupe for unzip, which is UnZip 6.00 of 20 April 2009. That’s a bit awkward since being a dupe, Homebrew won’t symlink the binaries by default.

This is not the first time I’ve wasted working around Apple’s awkward user space tools. The less file reader is also an outdated version, and last I checked was compiled with weird options like not respecting the LESSKEY environment variable.

I really should just stick to Ubuntu.

date +%s

Sharing a quick hack I use all the time

while :; do
  wget -q -O `date +%s.json` example.com/data.json
  sleep 60
done

Very simple shell hackery to download a URL once a minute and save it in a file named with a timestamp. The only midly subtle thing there is date +%s: that is a date format which is seconds since epoch, which is a conveniently sortable thing for filenames and timestamps and stuff.

(The while : instead of while true is idiosyncratic and anciently historical. In some very bizarre contexts true is the program /bin/true, the : is always a builtin.)

I have a bad habit of leaving scripts like the above running and forgetting about them, only to come back to a directory with 100,000 files in it.

 

Things I learned about systemd

systemd is the Linux startup system for Ubuntu 16.04, so I figured it’s time I learned about it. I’m very late to the party; systemd started some 6 years ago and became a real thing many Linux distributions used years ago. Ubuntu is the last big distro to include it by default except Gentoo, and even Ubuntu switched last year with 15.

There’s been a years-long debate in the Linux community about systemd. A younger me would have cared about that, maybe participated. These days I don’t care, I just want to use the consensus tech. The bummer is the flamewars dominate the conversation, it’s hard to learn just what systemd is. Here’s my effort to just understand it.

Disclaimer:  I’m not an expert and have undoubtedly made a mistake here somewhere. I reiterate my reminder that I really only write this blog as notes for myself. You’re welcome to comment, but please don’t waste my time with heated opinions.

References

I don’t have more links on actually using systemd in practice because I don’t yet have a system running it myself. That’s how out of date I am. The key pragmatic thing seems to be the unit files which configure how systemd launches services. systemd can also still launch stuff with old SysVInit-style scripts, including parsing the LSB headers.

Stuff I learned

systemd is init, the first user process on a Linux system, PID #1. Once the kernel is done it runs PID 1 whose job it is to then run all the other user stuff that makes a Unix system; mounting filesystems, configuring networking, starting daemons, etc.

The init most of us know is SysVInit, the shell scripts in /etc/rc.d. It hasn’t aged well in a modern era of multiple cores, dynamic devices, etc. Upstart is another Linux init system that Ubuntu has used for a long time, but no one seems to love it. Launchd is the MacOS init and deserves some respect for its speed and power; the systemd authors explicitly acknowledge launchd for inspiration.

systemd is a full rewrite of the idea of init. The primary goal seems to be mostly about efficiency. In the bad old days a Unix system could take 90 seconds to boot. With systemd (and MacOS launchd) < 1 second is possible and < 10 seconds is common. systemd is faster because it can do things in parallel. Also it avoids all the overhead of interpreting shell scripts. And supposedly it’s just modern and new and “better”, which I can’t really evaluate but can believe given the mess of stuff that preceded systemd.

systemd is Linux-only. It uses a bunch of non-POSIX APIs to do things like work with devices, configure security capabilities, etc.

I still haven’t used systemd hands-on so I haven’t learned much else. But my understanding is a whole bunch of Unix config gets consolidated into unit files in one place, which then tells systemd how to configure and launch user space programs.

I’m a little terrified of the Ubuntu 14 to 16 upgrade, it has to replace all the old stuff with the new. Will it work? Particularly reminded of an Ubuntu 12 to 14 upgrade I did where I manually installed something that changed the name of my ethernet device, making the machine unable to reboot with networking. Oops.

Opinions

I really don’t care about the history of the controversy; systemd’s a fait accompli. But it’s helpful to look at the debate to understand better what systemd is.

The big complaint about systemd everyone has is complexity. It does a lot of stuff in one big nest of 60+ interlocking binaries. It’s not just forking daemons. It’s got udev inside it to manage devices. It’s got its own logging system (in a binary format), a cron equivalent, an inetd equivalent, an atd equivalent, virtual terminals, a login daemon, watchdogs, etc. When you look at all the old Unix software that’s replaced it’s sort of astonishing and I can understand why believers in the “lots of little tools” philosophy don’t like it.

OTOH, the way the old components fit together in sysvinit was quite a hairball too. The shell hackery required to mount devices, then filesystems, then get networking up so you could access other filesystems and start other daemons in the right order and blah blah blah. Sure udev and mount might have been separate software, but they could only run correctly in one specific relationship to each other. That complex set of interdependencies had a lot of cost and somewhat made a lie of “lots of little tools”. Also there’s something to be said for unifying init and cron and inetd and login and the like; they’re all about launching programs in response to events, why not share the code?

There’s still a cultural concern that systemd, being a monolithic project, comes from one mindset only. Some of the choices seem odd to me; a binary log format for instance. That’s the part of the debate I can cheerfully withdraw from. For whatever reason most Linux distros have agreed on using systemd, and I trust that they all have the influence to shape it the way they need. That seems like a reasonable path to working software.

 

freedns annoyance

I’ve been using freedns.afraid.org for years for dynamic DNS, to access my home machines on dynamic IP addresses. Nice little service, valuable but not quite valuable enough to be paying money for.

Or so I thought. They have a thing where my domain name gets marked “dormant” and they change the DNS records to point to a static advertising site. I have to log in again to say “no I’m really using this still”, then wait for the one hour TTL expiration for things to work again. Ugh. Shades of DynDNS, the slow decline from being a useful free service to being crap.

I get it, they want to make money on me, OK. But their business goals are so modest it seems odd to inconvenience 2,500,000 free users in hopes of signing up 1000. Verification is useful; maybe they need to somehow clear out old unused names. Just wish there were some better way than “we broke your name good luck!”. Like a reminder email or something. The news page from 4 years ago says he’s working on a graceful notification. Guess that didn’t work out. (They seem to have been sending emails at one time but I’ve never seen one.)

The other DNS host I use is Hover, as part of their domain registry. They still don’t support any dynamic IP option.

Machine Learning study

My old colleague Karl Rosaen is taking time off from working to study Machine Learning. He’s meticulously documenting his curriculum and day to day work, good resources there.

One big part of Karl’s studying is Raschka’s book Python Machine Learning. It looks like a nicely practical book, I was looking for something like that last year and it wasn’t quite out yet. His example code is online. And his blog post about writing it is nice.

My biggest regret from my machine learning work is not applying it to real problems. Karl mentions that the Kaggle competitions are a good set of problems and datasets to work on.

wordpress.com code bug

I like posting source code on my blog. WordPress.com has a bug where it re-escapes my code every time I edit it. I am creating this blog post to file a support ticket, because I’m now a paying customer.

The code I want to post is the Python code

int(“1”) & 2

Ie; take the string 1, cast it to an int, then bitwise and it with the integer 2. I chose this because both the ” and & may get HTML escaped.

I’m following these instructions for posting code. Note they dont’ tell you which editor mode to do this in; Visual or HTML mode. I’m going to do both.

Both of these displayed correctly the first time. But the moment I edited the post again and saved it, it escaped the ” to be &quot; and the & to be &amp; It will keep adding more escaping every time I edit it. It’s very annoying.

Visual mode


int(&amp;quot;1&amp;quot;) &amp;amp; 2

HTML Mode

int(&amp;quot;1&amp;quot;) &amp;amp; 2

Notes

This bug is 4+ years old and yet still with us and seems to have been carried over to WordPress’ new visual editor, Calypso. I suspect they’re doing something naive like re-escaping every single time the editor switches modes. I tried asking wordpress.com front line support and just got a “need to look into this more” reply with a promise of an email followup.

Update: got a reply from support saying it was a known issue they were looking into and that “For now the solution is to stick to the HTML editor when writing code in blocks.” Well, that’s clear. I wonder if any of WordPress’ own engineers use WordPress to blog about code? Doesn’t this drive them crazy?

Sending SMTP email: a decade approach

I decided it’d be handy to be able to send SMTP email from my server in a datacenter. You know, like we’ve done since 1982? Only in modern times it really sucks to do this because of protections against spam, email spoofing, etc. Also I don’t want some giant network mail thing running and creating security headaches, I just want to be able to send mail off-host.

It turned out to be easy on Ubuntu 14.04. I followed this guide which boils down to “configure Postfix to be an Internet site, then make the daemon only listen on the loopback interface”. The other setup that’s important is a PTR record, so reverse DNS works. I think even without that mail should in theory work, but everyone might assume you’re a spammer.

With that setup mail to me @gmail.com worked and I’m sending email like we did back in the 1980s. But it also got classified as spam, and monkey.org was refusing to talk to me at all. The problem turned out to be that postfix was configured to use the hostname “ubuntu”. I fixed it by using the same FQDN as the PTR record (which also resolves to the same IP). Both Gmail and monkey.org will deliver my mail and it shows up as non-spam.

So now we’re up to 1990s email. Still, Gmail complained the mail it got was unencrypted. That was fixed by enabling “smtp_tls_security_level = may”. No idea why that’s not the default; the Postfix docs warn “You also turn on thousands and thousands of lines of OpenSSL library code. Assuming that OpenSSL is written as carefully as Wietse’s own code, every 1000 lines introduce one additional bug into Postfix.”. Which is a bit snide but fair enough given the OpenSSL history.

And now we’re up to 2000s email. Our modern era is much more complicated, with SPF and DKIM and other half-assed DNS based solutions to making email a bit more authenticated, but not really fixing it entirely. Those measures don’t yet seem mandatory, at least for low volume email, so I don’t yet have 2010s email configured.

One thing I left unsolved; the From: address. Mail is showing up as being from “nelson@wk.somebits.com”, which is an address you can’t deliver to. I’m OK with that, could fix it by spoofing the email, changing the Postfix from header, and/or adding MX records to enable mail for somebits.com.

(The other option to all this falderall is to give up on SMTP and mailer daemons entirely, just use a proprietary mail API. GMail has something for sending mail based on OAuth, and Amazon has SES.)