OGR, Ctypes, Cython

Leave now if you're not into Python extension programming and performance benchmarking: we're headed deep into the weeds.

Iterating over the features of a collection (or layer, if you will) and operating on them is one of a geo-processor's most fundamental patterns. We exercise this pattern over and over, and one of my design goals for Refinery is to do it more easily and more efficiently than it can be done with ogr.py. Recent correspondence with a Cython user convinced me that I should give it a try as a potentially faster way to wrap libgdal. Cython compiles Pyrex code (which is Python more or less) into C, which can then be compiled into a Python extension.

Last night I rewrote Refinery using Cython instead of ctypes and just finished getting the original tests to pass this morning. Is it faster? To find out, I benchmarked ogr.py, Refinery, and the new code I'm temporarily calling Ogre ("OGR, easier") using the timeit module. The Timer executes a block of statements in a manner that avoids the most common timing errors:

t = timeit.Timer(stmt=STATEMENTS)
print "%.2f usec/pass" % (1000000 * t.timeit(number=10)/10)

The Refinery and Ogre statements are basically the same:

w = ogre.workspace(PATH)
c = w[NAME]
for f in c:
    id = f.id

PATH is the filesystem path to a shapefile with 3784 features, NAME is the name of the data collection (or layer), and all I'm doing inside the iteration block is dereferencing the feature's id. The equivalent block of statements using ogr.py is:

source = ogr.Open(PATH)
layer = source.GetLayerByName(NAME)
layer.ResetReading()
while 1:
    feature = layer.GetNextFeature()
    if not feature:
        break
    id = feature.GetFID()
    feature.Destroy()
source.Destroy()

And now, the results (on a single Celeron processor machine that runs the Pystone benchmarks in 2.71):

Refinery (with ctypes)
4972288.61 usec/pass

Ogre (Cython/Pyrex)
682301.81 usec/pass

ogr.py (old-school bindings)
9183181.72 usec/pass

Ogre takes 180 microseconds per feature, about 7 times faster than Refinery, and about 13 times faster than ogr.py. I didn't benchmark the gdal module from the GeoDjango because I couldn't see anything that would make its performance much different from that of Refinery. I've heard that the QGIS community also has Python bindings for OGR in the works, but I'm not set up to build or benchmark those.

Cython makes for a faster module, but there is a price. It has to be compiled against the GDAL lib, which is a pain in the neck on Windows. It's non-standard, unlike ctypes, which means one more package to deploy before you can even start to build it. And while developing in the Pyrex language beats writing an extension in C (I've been there), it's still not Python. I've become dependent on the "import pdb; pdb.set_trace()" trick, and it's not an option with Cython. Still, the higher performance is rather enticing.

Yes, the cost of processing a feature could easily overshadow the per-feature cost of iteration, but here again there's potential for Cython to help in combination with Refinery or Ogre. The collection iterators have an object hook to which one could bind functions from compiled extension modules.

Curious? The code is here: Ogre-0.0.tar.gz.