Leave now if you're not into Python extension programming and performance benchmarking: we're headed deep into the weeds.
Iterating over the features of a collection (or layer, if you will) and operating on them is one of a geo-processor's most fundamental patterns. We exercise this pattern over and over, and one of my design goals for Refinery is to do it more easily and more efficiently than it can be done with ogr.py. Recent correspondence with a Cython user convinced me that I should give it a try as a potentially faster way to wrap libgdal. Cython compiles Pyrex code (which is Python more or less) into C, which can then be compiled into a Python extension.
Last night I rewrote Refinery using Cython instead of ctypes and just finished getting the original tests to pass this morning. Is it faster? To find out, I benchmarked ogr.py, Refinery, and the new code I'm temporarily calling Ogre ("OGR, easier") using the timeit module. The Timer executes a block of statements in a manner that avoids the most common timing errors:
t = timeit.Timer(stmt=STATEMENTS)
print "%.2f usec/pass" % (1000000 * t.timeit(number=10)/10)
The Refinery and Ogre statements are basically the same:
w = ogre.workspace(PATH)
c = w[NAME]
for f in c:
id = f.id
PATH is the filesystem path to a shapefile with 3784 features, NAME is the name of the data collection (or layer), and all I'm doing inside the iteration block is dereferencing the feature's id. The equivalent block of statements using ogr.py is:
source = ogr.Open(PATH)
layer = source.GetLayerByName(NAME)
layer.ResetReading()
while 1:
feature = layer.GetNextFeature()
if not feature:
break
id = feature.GetFID()
feature.Destroy()
source.Destroy()
And now, the results (on a single Celeron processor machine that runs the Pystone benchmarks in 2.71):
Refinery (with ctypes)
4972288.61 usec/pass
Ogre (Cython/Pyrex)
682301.81 usec/pass
ogr.py (old-school bindings)
9183181.72 usec/pass
Ogre takes 180 microseconds per feature, about 7 times faster than Refinery, and about 13 times faster than ogr.py. I didn't benchmark the gdal module from the GeoDjango because I couldn't see anything that would make its performance much different from that of Refinery. I've heard that the QGIS community also has Python bindings for OGR in the works, but I'm not set up to build or benchmark those.
Cython makes for a faster module, but there is a price. It has to be compiled against the GDAL lib, which is a pain in the neck on Windows. It's non-standard, unlike ctypes, which means one more package to deploy before you can even start to build it. And while developing in the Pyrex language beats writing an extension in C (I've been there), it's still not Python. I've become dependent on the "import pdb; pdb.set_trace()" trick, and it's not an option with Cython. Still, the higher performance is rather enticing.
Yes, the cost of processing a feature could easily overshadow the per-feature cost of iteration, but here again there's potential for Cython to help in combination with Refinery or Ogre. The collection iterators have an object hook to which one could bind functions from compiled extension modules.
Curious? The code is here: Ogre-0.0.tar.gz.
Comments
Re: Taming the OGR
Author: Matt Perry
Very cool. Some nice syntactic sugar on top of the unpythonic (but useful) OGR API. Out of curiosity, why "collections" rather than the more standard terminology of "layers"?Re: Taming the OGR
Author: Sean
Thanks, Matt. My other favorite feature is no dependence on SWIG. This uses the Python ctypes module, which is standard in 2.5. The term 'collection' is inspired by WFS, AtomPub, and GeoJSON. Is 'layer' really standard outside the GDAL community?Re: Taming the OGR
Author: Matt Perry
Ah! I assumed it was a thin layer on top of OGR SWIG bindings.. ctypes looks like the future of FFI and it's nice to see it being put to such good use! Layer is pretty much the universal word across all GIS camps for a group of vector features. This is the first I've ever heard of "collections" in reference to spatial data (apparently I don't delve too much into the web-based world too often these days ;-).Re: Taming the OGR
Author: Paul Ramsey
I think "Collection" is a better term for a set of features. "FeatureCollection" is somewhat more explicit. A "Layer" is a styled "FeatureCollection", it's the thing you see in your map, it's not the data itself, and one "FeatureCollection" can be the basis for multiple "Layers".Re: Taming the OGR
Author: Matt Perry
I like that. A workspace contains collections of features. Each collection can be paired with one or more styles to create layers. The only reason I'm harping on this is that anyone with a traditional GIS background will be flummoxed by the "collection" terminology at first. The last thing we need in the geospatial world is more confusion over vocabulary ... I'm still dealing with the whole "coverage" debacle!