Finding files with weird encodings

I’m working on (finally) converting my Blosxom blog to something modern, probably Hugo. One problem I have is that I was bad and wrote blog posts in different encodings. (These probably render wrong on my blog nowadays). Here’s a Linux command line to find all files that are not ASCII or UTF-8:

find . -print0 | 
  xargs -0 file -i |
  egrep -v '(ascii|utf-8)'

The heavy lifting is done by file -i which prints out MIME types, including a guessed encoding.

I thought all my files were ASCII, UTF-8, or ISO-Latin-1. But in fact a few are CP1252 thanks to Windows “smart quotes”. CP1252 doesn’t seem to be identified by file; it shows up as charset=iso-8859-1. It is a strict superset of ISO-Latin-1 so in my Python code I’m just going to pretend all the files that aren’t ASCII or UTF-8 are in CP1252.