That's About Right

Howard Butler sums up my motivations pretty well. The one factor he doesn't mention: community. There's a gap between the GDAL/OGR/MapServer/OSGeo and Python communities, which I jumped a while ago. Check out the list of GIS projects in the Python community's package index: Topic :: Scientific/Engineering :: GIS. There's not one OSGeo package there now; Howard's Python bindings for GDAL 1.5 will be the first. The GIS community is an inward-looking one, and the open source side of it looks outward only fractionally more.

Update: It's also true that the GDAL community's bindings aren't dead yet. As I pointed out previously, the new style bindings (to be standard in 1.5) are so much faster than the original bindings that everybody's a winner whether they switch to WorldMill or not.

WorldMill

Howard Butler pointed out that "Ogre" is already taken by a gaming graphics library, and "Refinery" seemed to imply that OGR is raw and crude. I've wanted a name that resonated with the sounds and smells of a machine shop. "Mill" is just about right.

Now, the new benchmark times. The script is now in the Subversion repository: benchmark.py. I've added to the ogr.py block the code that is equivalent to the extra (and extra handy) baggage carried by WorldMill, the marshaling of Python collection (OGR layer) schema and feature data. This stuff is compiled in WorldMill, interpreted with ogr.py, and the performance difference is obvious:

sean@lenny:~/Projects/WorldMill$ PYTHONPATH=src python benchmark.py
WorldMill (Cython)
8633.75 usec/pass

ogr.py (new bindings)
22740.06 usec/pass

Enough blogging, here it is: WorldMill 0.1.

Update: the new benchmarks involve a 50 record shapefile that's distributed with WorldMill, hence the timing differences from the previous benchmarks.

OGR, Ctypes, and Cython, Again

My previous benchmarks were made using the original recipe ogr.py module from GDAL 1.3.2. I saw some numbers that compelled me to try to new Python bindings from GDAL 1.4.3. Here are the new results (same benchmark code) on that same machine:

sean@lenny:~/Projects$ python ogre_benchmark.py
Refinery (with ctypes)
4994133.40 usec/pass

Ogre (Cython/Pyrex)
593522.91 usec/pass

ogr.py (next generation bindings)
594103.50 usec/pass

Howard Butler's new bindings for 1.4.3 are about 15 times faster at this task than the old bindings for 1.3.2. I had no idea that performance was so improved. That's great news for people who want to keep using programs based on ogr.py. You'll notice that Ogre itself is 15% faster today thanks to conversion of some Python statements to C. I feel like I'm starting to know my way around Pyrex and Cython and can squeeze out some higher performance yet.

Ogre is as quick as the latest and greatest ogr.py, and does more work at the same price. For each collection iteration, Ogre is marshaling the OGR layer definitions into a collection schema, and marshals every OGR feature into a GeoJSON-like data structure using the collection object hook. Even with that overhead, Ogre is as fast as ogr.py. Remove the extras and it compares to ogr.py like this:

sean@lenny:~/Projects$ python ogre_benchmark.py
Ogre (no auto schema, no object hook)
443894.31 usec/pass

ogr.py (next generation bindings)
588154.32 usec/pass

That's 25% faster. Not bad at all, considering that better performance was never my motivation.

OGR, Ctypes, Cython

Leave now if you're not into Python extension programming and performance benchmarking: we're headed deep into the weeds.

Iterating over the features of a collection (or layer, if you will) and operating on them is one of a geo-processor's most fundamental patterns. We exercise this pattern over and over, and one of my design goals for Refinery is to do it more easily and more efficiently than it can be done with ogr.py. Recent correspondence with a Cython user convinced me that I should give it a try as a potentially faster way to wrap libgdal. Cython compiles Pyrex code (which is Python more or less) into C, which can then be compiled into a Python extension.

Last night I rewrote Refinery using Cython instead of ctypes and just finished getting the original tests to pass this morning. Is it faster? To find out, I benchmarked ogr.py, Refinery, and the new code I'm temporarily calling Ogre ("OGR, easier") using the timeit module. The Timer executes a block of statements in a manner that avoids the most common timing errors:

t = timeit.Timer(stmt=STATEMENTS)
print "%.2f usec/pass" % (1000000 * t.timeit(number=10)/10)

The Refinery and Ogre statements are basically the same:

w = ogre.workspace(PATH)
c = w[NAME]
for f in c:
    id = f.id

PATH is the filesystem path to a shapefile with 3784 features, NAME is the name of the data collection (or layer), and all I'm doing inside the iteration block is dereferencing the feature's id. The equivalent block of statements using ogr.py is:

source = ogr.Open(PATH)
layer = source.GetLayerByName(NAME)
layer.ResetReading()
while 1:
    feature = layer.GetNextFeature()
    if not feature:
        break
    id = feature.GetFID()
    feature.Destroy()
source.Destroy()

And now, the results (on a single Celeron processor machine that runs the Pystone benchmarks in 2.71):

Refinery (with ctypes)
4972288.61 usec/pass

Ogre (Cython/Pyrex)
682301.81 usec/pass

ogr.py (old-school bindings)
9183181.72 usec/pass

Ogre takes 180 microseconds per feature, about 7 times faster than Refinery, and about 13 times faster than ogr.py. I didn't benchmark the gdal module from the GeoDjango because I couldn't see anything that would make its performance much different from that of Refinery. I've heard that the QGIS community also has Python bindings for OGR in the works, but I'm not set up to build or benchmark those.

Cython makes for a faster module, but there is a price. It has to be compiled against the GDAL lib, which is a pain in the neck on Windows. It's non-standard, unlike ctypes, which means one more package to deploy before you can even start to build it. And while developing in the Pyrex language beats writing an extension in C (I've been there), it's still not Python. I've become dependent on the "import pdb; pdb.set_trace()" trick, and it's not an option with Cython. Still, the higher performance is rather enticing.

Yes, the cost of processing a feature could easily overshadow the per-feature cost of iteration, but here again there's potential for Cython to help in combination with Refinery or Ogre. The collection iterators have an object hook to which one could bind functions from compiled extension modules.

Curious? The code is here: Ogre-0.0.tar.gz.

Transactional Refinery

Talk about serendipity: repoze.tm. I was just thinking about data transactions for Refinery, and now there's a data manager interface I can implement. It's hard to keep up with the rush of fresh ideas coming out of Repoze.

WSGI aside, I think this would be a smooth transactional API:

>>> w = refinery.workspace('/tmp/data')
>>> collection = w['places']
>>> transaction = collection.begin()
>>> collection.features.append(feature)
>>> transaction.commit() # write to disk

Use an OGR in-memory feature store or new Python structure for uncommitted features? That I don't yet know.

PleiadesGeocoder 1.0b1

Here it is, the first 1.0 beta release of the ultimate content geo-annotation plugin for Plone 2.5: PleiadesGeocoder-1.0b1.

It's deployed on the Pleiades site (see links on the project wiki), and also at telascience.org: append 'georss' or 'kml' to any of the channel URLs to get an alternative representation of those aggregated feeds. Earlier versions of this software were inspired by the simple profile of GeoRSS, but 1.0 is aligned with GeoJSON. You can see it in the primary interface:

class IGeoItemSimple(Interface):

    """A simple georeferenced object, analogous to a GeoRSS entry, a KML
    placemark, or a GeoJSON feature.

    The spatial reference system is implicitly lat/long WGS84.
    """

    geom_type = Attribute(
        """Name of geometry type: 'Point', 'LineString', or 'Polygon'"""
        )

    coords = Attribute(
        """GeoJSON-style coordinates tuple(s) with long, lat ordering."""
        )

    __geo_interface__ = Attribute(
        """A mapping that provides Python geo interface. See
        http://trac.gispython.org/projects/PCL/wiki/PythonGeoInterface.
        """
        )

    def setGeoInterface(geom_type, coords):
        """Set type and coordinates, following the geo interface spec."""

    def isGeoreferenced():
        """Returns True if the item is "on the map", meaning that the item
        has a geometry type and coordinates."""

PleiadesGeocoder allows annotation of content types derived from Archetypes right out of the box. Plone developers who know their way with Zope and Five should find it easy to adapt non-AT content.

Taming the OGR

This evening I made a protoype of a smoother, simpler interface to the industrial-strength vector data functions in libgdal. The new entities in Refinery are workspaces, collections, and features. Workspace.collections is a dict of Collection objects, which are approximately equal to OGR layers. Collection.features is an iterator over GeoJSON-style objects:

>>> workspace = refinery.workspace('/var/gis/data/world')
>>> workspace.collections.keys()
['world_borders']
>>> borders = workspace.collections['world_borders']
>>> for feature in borders.features:
...     print feature.id, feature.properties['CNTRY_NAME']
world_borders/0 Aruba
(etc)

Use of the refinery package reduces the size of my canonical matplotlib script from 17 to 12 lines:

import pylab
from numpy import asarray
import refinery
from shapely.wkb import loads

workspace = refinery.workspace('/var/gis/data/world')
borders = workspace.collections['world_borders']

fig = pylab.figure(1, figsize=(4,2), dpi=300)

for feature in borders.features:
    shape = loads(feature.geometry)
    a = asarray(shape.exterior)
    pylab.plot(a[:,0], a[:,1])

pylab.show()

That's starting to feel civilized.

This started off as recreational programming, but I think it has some promise. Get the code [Refinery-0.0.tar.gz] if you're curious and make a trial workspace from your own data. Next: a look at the GeoDjango GDAL wrappers to see what Justin and I are doing differently.

Comments

Re: Taming the OGR

Author: Matt Perry

Very cool. Some nice syntactic sugar on top of the unpythonic (but useful) OGR API. Out of curiosity, why "collections" rather than the more standard terminology of "layers"?

Re: Taming the OGR

Author: Sean

Thanks, Matt. My other favorite feature is no dependence on SWIG. This uses the Python ctypes module, which is standard in 2.5. The term 'collection' is inspired by WFS, AtomPub, and GeoJSON. Is 'layer' really standard outside the GDAL community?

Re: Taming the OGR

Author: Matt Perry

Ah! I assumed it was a thin layer on top of OGR SWIG bindings.. ctypes looks like the future of FFI and it's nice to see it being put to such good use! Layer is pretty much the universal word across all GIS camps for a group of vector features. This is the first I've ever heard of "collections" in reference to spatial data (apparently I don't delve too much into the web-based world too often these days ;-).

Re: Taming the OGR

Author: Paul Ramsey

I think "Collection" is a better term for a set of features. "FeatureCollection" is somewhat more explicit. A "Layer" is a styled "FeatureCollection", it's the thing you see in your map, it's not the data itself, and one "FeatureCollection" can be the basis for multiple "Layers".

Re: Taming the OGR

Author: Matt Perry

I like that. A workspace contains collections of features. Each collection can be paired with one or more styles to create layers. The only reason I'm harping on this is that anyone with a traditional GIS background will be flummoxed by the "collection" terminology at first. The last thing we need in the geospatial world is more confusion over vocabulary ... I'm still dealing with the whole "coverage" debacle!