OGR/Python vs. Unicode

I’ve been trying to figure out how OGR deals with Unicode, particularly in shapefiles, to solve this issue with OpenAddress conversion. I haven’t found clear docs. There’s something about a SHAPE_ENCODING environment variable and something else about OGR trying to guess. More details in that linked GitHub issue. Anyway, here’s what I discovered.

Good file

I have a working non-ASCII shapefile that seems to be in UTF-8. It’s ca-qc-gatineau for us, stored at http://gatineau.ca/donneesouvertes/telechargement/ADRESSE.zip, and has street names like Rue de Cassiopée that our code is working for us.

When OGR/Python opens the file, the capability OLCStringsAsUTF8 is set to True.

In Python 2, GetField() is returning Python strings of type str, which means “bytes”. Those strings appear to be sequences of UTF-8 code points, the repr() of that non-ascii street name is ’76 Rue de Cassiop\xc3\xa9e’. Basically OGR isn’t trying to handle Unicode at all for us, or rather if it is it’s returning UTF-8 encoded byte strings and all is well.

In Python 3, GetField() is also returning Python strings of type ‘str’, which now means ‘unicode’. Their repr in Py3 is a Unicode string, ‘Rue de Cassiopée’, which makes sense.

In both cases OGR/Python is doing the right thing, or at least something consistent and sensible.

Bad file

I have a non-ASCII shapefile that seems to be in Latin 1. It’s be-flanders for us, stored at https://downloadagiv.blob.core.windows.net/crab-adressenlijst/Shapefile/CRAB_Adressenlijst.zip. It has street names like ‘Nowélei (Jean Baptiste)’ although that é isn’t coming through right.

When OGR/Python opens the file, the capability OLCStringsAsUTF8 is set to False.

In Python 2, GetField still returns ‘str’. And the repr for that street is ‘Now\xe9lei (Jean Baptiste)’. That’s actually not awful; if I know to expect that behavior, I can decode that myself with ISO-8859-1 and get the Unicode string.

In Python 3, GetField() throws an exception.

Traceback (most recent call last):
  File /home/nelson/src/oa/shpenc.py, line 17, in <module>
    field = in_feature.GetField(i)
  File /usr/lib/python3/dist-packages/osgeo/ogr.py, line 3033, in GetField
    return self.GetFieldAsString(fld_index)
  File /usr/lib/python3/dist-packages/osgeo/ogr.py, line 2362, in GetFieldAsString
    return _ogr.Feature_GetFieldAsString(self, *args)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte

I don’t quite understand what is happening here, but I guess something inside the OGR code is trying to decode that sequence assuming it’s UTF-8 and failing. Note my code is not really in the stack trace here, I”m just asking for the field contents. I think this is an OGR bug. Or else I think the Shapefile is corrupt. Setting SHAPE_ENCODING before invoking my code doesn’t help.

Worth noting QGIS displays the string as Nowlei, skipping the non-ascii character.

Summary

It appears that what OGR is doing is trying to guess and cope with the source encoding, presenting UTF-8 byte strings to Python 2 code and Unicode strings to Python 3 code. Which is great! Only things go wrong with the be-flanders file. OGR presents an ISO-8859-1 byte string to Python 2, which you can cope with. OGR crashes in Python 3.

Test code

Here’s the little OGR/Python program I’m using to examine shapefiles. It runs in both Python2 and Python3.

import sys
from osgeo import ogr, osr
ogr.UseExceptions()

in_datasource = ogr.Open(sys.argv[1], 0)
in_layer = in_datasource.GetLayer()
inSpatialRef = in_layer.GetSpatialRef()

in_layer_defn = in_layer.GetLayerDefn()
in_feature = in_layer.GetNextFeature()
print('OLCStringsAsUTF8? %r' % in_layer.TestCapability(ogr.OLCStringsAsUTF8))
while in_feature:
    row = dict()
    for i in range(0, in_layer_defn.GetFieldCount()):
        field = in_feature.GetField(i)
        sys.stdout.write('%s %r ' % (type(field), field))
        in_feature = in_layer.GetNextFeature()
    sys.stdout.write('\n')