Following on my OGR/Python vs Unicode work, I took a look at what Fiona is doing. Fiona is a nice Pythonic interface to OGR with a fair amount of Cython code to bridge the gap from the OGR C library to Python idioms. Fiona explicitly returns all strings as u’Unicode Strings’ in both Python 2 and 3 and has code to handle encodings.
Fiona tries to guess encodings. In theory you can override the guess with an encoding=’foo’ parameter when you open a source, but in practice that seems to do nothing. I tried passing in obviously wrong encodings like ‘ascii’ or ‘shift-jis’ and didn’t see any of the expected errors.
Below is the repr of the string Cassiopée as it comes through from the Shapefile in Fiona, with its own guessing. These all appear correct to me. Note that Python 2 encodes Unicode string reprs as ASCII using Python \x escaping, hence the \xe9 in the output. That’s a single character, the expected Unicode codepoint U+00e9 or é.
Good file (ca-qc-gatineau)
Python2. Fiona: u’Cassiop\xe9e’ OGR: b’76 Rue de Cassiop\xc3\xa9e’
Python3: ‘Cassiopée’ OGR: u’Cassiopée’.
Bad file (be-flanders)
Python2. Fiona: u’Now\xe9lei’ OGR: b’Now\xe9lei’
Python3. Fiona: u’Nowélei’ OGR: exception.
Fiona seems to be doing the right thing with both my inputs, returning proper Unicode strings in both Python 2 and Python 3. In Python 2 OGR seems to basically not be decoding at all, just returning byte strings that I’m supposed to decode myself. In Python 3 OGR is trying to return Unicode strings but throws an exception on my bad file.
I think from here, for our OGR code I should add something to explicitly decode strings to Unicode in Python 2 and file a bug in Python 3.
'Extract fields from shapefiles using Fiona, a Unicode test' import sys, fiona, logging from pprint import pprint logging.getLogger().setLevel(logging.DEBUG) logging.basicConfig(stream=sys.stderr, level=logging.DEBUG) enc = sys.argv if len(sys.argv) > 2 else None with fiona.open(sys.argv, 'r', encoding=enc) as source: for f in source: for field in f['properties'].values(): sys.stdout.write('%s %r ' % (type(field), field)) sys.stdout.write('\n')