Rtree, Shapely, and WorldMill: Jamming Econo

Update (2008-01-26)

I've added Python 2.5 to the buildout. It adds a minute or two to the build time, but gives you a much more isolated environment in which to jam. I also found a work-around for the issue reported in zc.buildout bugs 110133 and 138260: building WordMill requires Cython in the working set of the custom python, something that cannot be accomplished using zc.recipe.egg. What I've done is used Kai's hexagonit.recipe.download to fetch the Cython source, and then used iw.recipe.cmd to install it into the buildout's custom python. See the cython-src and cython-install sections in buildout.cfg. If you've already fetched Gdawg once, I recommend you discard it and clone a fresh copy.

I just added Shapely 1.0 and Rtree 0.4 to the Gdawg buildout, where they join WorldMill 0.1. Together they create a friendly environment on the C Python platform where you can read GIS feature data, spatially index it, and manipulate its geometries. (Sorry, Windows users are out of luck until the next Rtree and WorldMill releases. Patches are welcome.)

Again, getting Mercurial (Hg) is as easy as

$ apt-get install mercurial

on a Debian/Ubuntu system. Check the downloads page for other installers (Gdawg does build on a Mac). After you've installed Hg, clone my repo, and build it out:

$ hg clone http://sgillies.net/sgillies/hg/gdawg my-gdawg
$ cd my-gdawg
$ python bootstrap.py
$ ./bin/buildout

It could take up to 15 minutes to build GEOS and GDAL. In the meanwhile, grab some data to play with. I downloaded the Zillow Colorado neighborhoods and extracted them into /tmp to see if they lived up to the hype. When the buildout script finishes, start up the custom Python interpreter

$ ./bin/gdawgpy

I need to make a funny preamble someday, but there's the Python prompt. To begin, let's create a WorldMill workspace:

>>> from mill import workspace
>>> ws = workspace('/tmp/ZillowNeighborhoods-CO.zip_FILES')
>>> ws
<mill.workspace.Workspace object at ...>

Which allows me to use the only Italian I haven't yet forgotten: va bene. A workspace is a mapping of collections, as you can see here:

>>> ws.items()
[('ZillowNeighborhoods-CO', <mill.collection.Collection object at ...>)]

Access the neighborhoods collection and inspect it briefly:

>>> co = ws['ZillowNeighborhoods-CO']
>>> co
<mill.collection.Collection object at ...>
>>> len(co)
>>> co.schema
[('STATE', 4), ('COUNTY', 4), ('CITY', 4), ('NAME', 4), ('REGIONID', 2)]

95 neighborhoods (of note) in Colorado, 5 attributes per feature, all of them strings except for REGIONID, which is an int. Let's look now at the first neighborhood feature:

>>> x = co['0']
>>> x
<mill.feature.Feature object at ...>
>>> x.id
>>> x.properties['NAME']
>>> x.properties['CITY']

Hmm, they misspelled "Shelbyville". If you were to access x.geometry at the prompt, you'd get a small binary flood. By default, unless an object hook has been specified, feature geometry is expressed as WKB (Long/Lat). Let's now set an object hook for this geometry so that we get features with Shapely geometries:

>>> from mill.feature import Feature
>>> from shapely.wkb import loads
>>> def shapely_feature(id, properties, wkb):
...     return Feature(id, properties.copy(), loads(wkb))
>>> co.object_hook = shapely_feature

The only requirement on the object hook is that it be a callable with 3 positional parameters. Now, get the first feature again:

>>> x = co['0']
>>> x.id
>>> x.geometry
<shapely.geometry.polygon.Polygon object at ...>
>>> x.geometry.bounds
(-105.26320656676199, 40.010885702215496, -105.243224298828, 40.038423410870699)

And that's how you integrate Shapely and WorldMill. Now, how about spatially indexing the neighborhoods using Rtree? First, create a named index that will be persisted on disk next to the shapefile data:

>>> from rtree import Rtree
>>> index = Rtree('/tmp/ZillowNeighborhoods-CO.zip_FILES/ZillowNeighborhoods-CO')

Then iterate over features in the collection, adding each to the index in turn:

>>> for feature in co.all:
...     index.add(int(feature.id), feature.geometry.bounds)

Pretty fast, eh? Now let's find some of the "Crossroads" feature's neighbors by putting its bounding box back into an intersection query:

>>> b = co['0'].geometry.bounds
>>> index.intersection(b)
[42L, 32L, 0L, 49L, 15L, 64L, 66L]

The index returns Python longs, but we can get the corresponding features from the collection like so:

>>> neighborhoods = [co[str(uid)] for uid in index.intersection(b)]
>>> from pprint import pprint
>>> names = [n.properties['NAME'] for n in neighborhoods]
>>> pprint(names)
['North Boulder',
 'Colorado University',
 'Southeast Boulder',
 'East Boulder',
 'Palo Park',
 'Central Boulder']

Incidentally, there don't appear to be any Fort Collins neighborhoods:

>>> index.intersection((-105.09, 40.58, -105.08, 40.59))
>>> [f.id for f in co.all if f.properties['COUNTY'] == 'Larimer']

which is fine because we don't really need anybody else moving here unless they are going to open a nice old world bakery or cheese shop downtown.

If I may say so, Shapely, Rtree, and WorldMill are just about the best trio since D. Boon, George Hurley, and Mike Watt.