Barrington Atlas Feature IDs and Unicode Normalization

Last week I was NYU's Institute for the Study of the Ancient World to plan our Concordia project, an effort to interlink projects like Pleiades, IRCyr, the ANS database, and APIS, and build a traversable, searchable network of data based on Web architecture. I'll be blogging more about this all through the year -- expect my blog to intersect more with the Mapufacture and FortiusOne blogs if they continue on course. Some may even be interesting to mainstream GIS designers and developers. ESRI's implicit embrace of Web architecture (along with the explicit embrace of Google) was one of the big stories out of Where 2.0, after all.

One thing that cropped up in our sessions was the need of other Ancient World projects to be able to refer to Barrington Atlas features in Pleiades by URIs derived from their atlas labels. Tom came up with a template for these URIs:

http://pleiades.stoa.org/batlas/{label-normalized}-{map}-{grid}

and simple rules for normalization that just so happen to be already implemented in plone.i18n. We've forked (in a friendly way: forking is now cool thanks to git, right?) plone.i18n and removed the Zope utilities and all dependency on the Zope component architecture. The result is pleiades.normalizer, and it reduces Barrington Atlas labels which may contain annotation and non-ASCII characters to ASCII strings suitable for use in the URI template:

>>> from pleiades.normalizer import normalizer

>>> list(normalizer.normalizeN(u'Tetrapyrgia'))
['tetrapyrgia']

>>> list(normalizer.normalizeN(u'Timeles fl. '))
['timeles-fl']

>>> list(normalizer.normalizeN(u'*Tyinda'))
['tyinda']

>>> list(normalizer.normalizeN(u'[Agrai]'))
['agrai']

>>> list(normalizer.normalizeN(u'Kalaba(n)tia'))
['kalabantia']

>>> list(normalizer.normalizeN(u'Tripolis ad Maeandrum/Apollonia ad Maeandrum/Antoniopolis'))
['tripolis-ad-maeandrum', 'apollonia-ad-maeandrum', 'antoniopolis']

>>> list(normalizer.normalizeN(unicode('Ağva', 'utf-8')))
['agva']

>>> list(normalizer.normalizeN(unicode('Çaykenarı', 'utf-8')))
['caykenari']

The algorithm normalizes non-ASCII characters (normal form KD) and discards elements which are not letters or digits:

U+011F -> (g, U+0306) -> g

U+0131, the last character in Çaykenarı, is a bit of a troublemaker. Our ASCII "i" has the diacritical mark relative to its dotless latin cousin, the inverse of the usual situation, and we have to make a special exception for it in the code.

If you're getting started in web development with Python you might find pleiades.normalizer handy, or at least a starting point for your own normalization code. We won't be publishing it to PyPI, but you can get an egg via:

$ easy_install http://atlantides.org/eggcarton/pleiades.normalizer-0.1.tar.gz