GNU cut breaks on UTF-8

cut is a very simple shell util; it will take a few characters or fields out of its input. cut -c 10-20 for instance should output characters 10-20 on every line.

Except hahahaha joke’s on you, despite being called --characters and there being a separate --bytes flag, they both actually operate on bytes. No matter what your locale is. This completely breaks UTF-8 input; it does not treat them as characters.

This behavior is not disclosed in the cut --help docs, nor in man cut on my Ubuntu system. It is documented in info cut, the obsolete doc tool rms failed to foist on the world 25+ years ago. There it says:

Select for printing only the characters in positions listed in CHARACTER-LIST. The same as ‘-b’ for now, but internationalization will change that.

There’s a good Stack answer from 2014 explaining all this. I even understand why the GNU tool does this, it’s as much POSIX’ fault as GNU’s. But you’d think in the intervening 9 years they could at least improve the docs.

Here’s a 2006 bug report documenting the same problem. I would not be surprised if the bug goes back to the 90s, predating the adoption of UTF-8 everywhere.

2 thoughts on “GNU cut breaks on UTF-8

  1. info is by no means obsolete; it is a very nice manual interface, and learning how to use it properly opens up all kinds of knowledge to you that is already present on your operating system. Not even manuals on tools, but especially the info files for the libc are more detailed than anything you find online (and by “find”, I mean the results that Google is spitting out).

  2. I’ve known how to use info properly for, oh, 25 years now? The problem is that the GNU tools alone have a unique documentation system. The docs in info are indeed well written, but they aren’t in the standard Unix place, the man pages. Over the years distributers like Debian/Ubuntu have added their own man pages (generated from –help, I think) which are out of sync with the info pages. That’s not GNU’s fault exactly but the fact cut –help is wrong is.

    info was a failed effort by FSF 25-30 years ago to redesign the Unix command line experience, I wish they’d just acknowledge no one else uses it and move on. Other parts of the GNU UI like readline and –help itself have been huge successes.

Comments are closed.