Teaching Python GIS users to be more rational

We, developers and documenters of GIS software for Python, have been teaching users to code irrationally. I'm certain that there's a cost to this and that we need to try to undo the damage as much as we can. Irrational coding happens in the real world. I'm largely self-taught as a programmer and have a long history of using and misusing software that I don't fully understand, cargo-culting code, etc. I've been there. This is part of the reason we write software, to put power in the hands of less expert computer users, and it seems to necessarily invite the problem of irrational programming. When software is baffling rather than enlightening, it can make the problem worse. Software ought to instead teach its users to think more rationally and reward them for doing so. In theory, there ought to be a virtuous circle here for open source software: rational users providing better bug reports and more effective community support, and becoming code contributors at a higher rate. Teaching users to be irrational, on the other hand, guarantees a steady and soul-sucking stream of frequently asked low grade support questions.

I'm thinking about two things going on in Python GIS software in particular. Exhibit A is the practice of teaching users to use del to close or finalize externalize resources like database connections and cursors. See http://gis.stackexchange.com/search?q=%5Bpython%5D+%22+del+%22. Think hard about this when you're developing, documenting, and supporting Python APIs. Do you really want your users to have to know so much about the details of del or what happens to Python objects after they employ del? And whether it's dependent on their interpreter's garbage collection? Do you really want to teach them to rely on side-effects of del?

Python's del statement is all about namespace and container management and has nothing directly to do with finalizing external resources:

Deletion of a name removes the binding of that name from the local or global namespace, depending on whether the name occurs in a global statement in the same code block. If the name is unbound, a NameError exception will be raised.

It is illegal to delete a name from the local namespace if it occurs as a free variable in a nested block.

Deletion of attribute references, subscriptions and slicings is passed to the primary object involved; deletion of a slicing is in general equivalent to assignment of an empty slice of the right type (but even this is determined by the sliced object).

Think about this too: will del work to close and finalize external stuff if the name you're deleting is not in a local or global namespace, but is a list or dict item? Or if you've bound two names to the same object do you have to del both names or just one?

Any closing of external resources is a side effect, and side effects in software are almost always less than deterministic in nature. Teaching users to rely on side effects is, in effect, teaching them to program irrationally. Here's a user (not a GIS user, either) who has been particularly harmed, IMO: http://stackoverflow.com/a/12626643/159235.

The reason that del works at all when it does is that the object to which the name was bound has a __del__() method that calls another method that finalizes. Just make that other method part of your API, document it well, and you have users programming rationally and getting rational results like external files and resources closed and freed precisely when they should be.

To their credit, the ArcPy developers have corrected this in recent versions, but any visit to GIS StackExchange shows that the damage caused to programmers continues.

Exhibit B is the practice of teaching users to rebind names to None to finalize, as in GDAL/OGR Python bindings. Before you write an API like this or write a tutorial or book chapter that advocates this, ask yourself what is so special about None in this case? Will rebinding that name to 0, or True, or "abracadabra!" work just as well? Why or why not? A library developer or teacher or writer should know the answer to this and be able to explain it.

None truly is special in Python (just compare the results of None = 1 and False = 1 at your interpreter prompt to see), but is its special nature the thing that causes finalization, or is it the binding, or what? What is actually going on in the software and how do we teach users to use it rationally?

In fact, the reason rebinding to None works when it does is the same as above: some finalizing method is called by the interpreter when the object is no longer referenced by anything else. Just expose that method to programmers and you have a less baffling API.

In other words, explicit is better than implicit. Teach users to close and finalize external resources explicitly instead of teaching them to do something else that has finalization as a side effect, and let's see if we all profit from more rational Python GIS programmers.

Is rasterio fast enough?

I've been working on a Python package named rasterio and have blogged about my goals: more Python idioms, no gotchas, and speed. I'm doing well on the first two, but now let's look at the last of these. Is rasterio fast enough?

Here is a raster band read to Numpy ndarray benchmark with a comparison to osgeo.gdal. The RGB.byte.tif file is a small (about 800 x 800 pixel) GeoTIFF file. I'm going to open the file, read the entirety of the first band to a ndarray, and then close the file. The gdal Dataset class has no explicit closing method, so I'm assigning the variable to None and indirectly freeing up the GeoTIFF file. In rasterio, you should call an opened file's close() method when done, or use a with statement.

import timeit

import rasterio
from osgeo import gdal

n = 100

# GDAL read as array
s = """
src = gdal.Open('rasterio/tests/data/RGB.byte.tif')
arr = src.GetRasterBand(1).ReadAsArray()
src = None
"""

t = timeit.timeit(s, setup='from osgeo import gdal', number=n)
print("GDAL:")
print("%f usec\n" % (t/n))

# Rasterio read band
s = """
with rasterio.open('rasterio/tests/data/RGB.byte.tif') as src:
    arr = src.read_band(1)
"""

t = timeit.timeit(s, setup='import rasterio', number=n)
print("Rasterio:")
print("%f usec\n" % (t/n))

Here are the results on my 1.7 GHz Intel Core i7 MacBook Air:

$ python benchmarks/ndarray.py
GDAL:
2.356141 ms

Rasterio:
2.655728 ms

I'm pleased that Rasterio is only about 15% slower for this benchmark. And the difference between the two will diminish as files grow larger. Rasterio does considerably more validation of input and logs quite a bit. Also, rasterio could be a little more lazy about getting and formatting a file's coordinate referencing system. Less code and fewer bugs is the main goal of rasterio, so getting close to gdal.py performance-wise is icing on the cake.

Some consequences of GeoJSON features as arrays

Don't mind me, I'm just thinking out loud about the GeoJSON format.

A GeoJSON Feature Collection is an object with a "features" key and the value for the key is specified to be an array of Feature objects. Is the order of Features significant? In the context of my previous post, should a zero or empty RFC 6902 diff between this

{ "type": "FeatureCollection",
  "features": [
    { "type": "Feature",
      "id": "foo",
      ... },
    { "type": "Feature",
      "id": "bar",
      ... } ] }

and this, where the feature array has been reversed,

{ "type": "FeatureCollection",
  "features": [
    { "type": "Feature",
      "id": "bar",
      ... },
    { "type": "Feature",
      "id": "foo",
      ... } ] }

be expected? Strictly speaking, no, because in JSON (see json.org or ECMA 404)

An array is an ordered collection of values.

whereas

An object is an unordered set of name/value pairs.

My question is related to one asked on the IETF's apps-discuss list in January. Should we expect to be able to patch GeoJSON documents using the ids of individual features? Being able to do so is more or less the reason why GeoJSON Features have unique ids, but it's clear to me that a generic RFC 6902 won't do. As James Snell points out in that thread, more expressive paths will be required.

In my previous post on this topic (see link at the top of this one) I looked at the addition of a new feature to the tail of the features array. The JSON patch in this case was very simple. But what if that feature had been inserted at the head of the features array? Is the resulting patch going to be something that reads like "push back existing items and add new item to the head?" The result from python-json-patch isn't pretty.

[
  {
    "path": "/features/0/geometry/coordinates/0",
    "value": -122.65496134757996,
    "op": "replace"
  },
  {
    "path": "/features/0/geometry/coordinates/1",
    "value": 45.5212590460849,
    "op": "replace"
  },
  {
    "path": "/features/0/properties/Notes",
    "op": "remove"
  },
  {
    "path": "/features/0/properties/Name",
    "value": "Three Friends Coffeehouse",
    "op": "replace"
  },
  {
    "path": "/features/0/properties/Address",
    "value": "201 SE 12th Ave, Portland, OR 97214",
    "op": "replace"
  },
  {
    "path": "/features/1/geometry/coordinates/0",
    "value": -122.67516911029816,
    "op": "replace"
  },
  {
    "path": "/features/1/geometry/coordinates/1",
    "value": 45.55673233031101,
    "op": "replace"
  },
  {
    "path": "/features/1/properties/Notes",
    "value": "usually busy, outlets on side wall only",
    "op": "replace"
  },
  {
    "path": "/features/1/properties/Name",
    "value": "Albina Press",
    "op": "replace"
  },
  {
    "path": "/features/1/properties/Address",
    "value": "4637 N Albina Ave  Portland, OR 97217",
    "op": "replace"
  },
  {
    "path": "/features/2",
    "value": {
      "geometry": {
        "type": "Point",
        "coordinates": [
          -122.68242716789246,
          45.56997505986905
        ]
      },
      "type": "Feature",
      "properties": {
        "Notes": "",
        "Name": "Arbor Lodge",
        "Address": "1507 N Rosa Parks Way  Portland, OR 97217"
      }
    },
    "op": "add"
  }
]

Unless I misunderstand RFC 6901 and RFC 6902, the following ought to be possible.

[
  { "op": "move", "from": "/foo/1", "path": "/foo/2" },
  { "op": "move", "from": "/foo/0", "path": "/foo/1" },
  { "path": "/features/0`",
    "value": {
      "geometry": {
        "type": "Point",
        "coordinates": [
          -122.65496134757996,
          45.5212590460849
        ]
      },
      "type": "Feature",
      "properties": {
        "Name": "Three Friends Coffeehouse",
        "Address": "201 SE 12th Ave, Portland, OR 97214"
      }
    },
    "op": "add" }
]

Making concise patches won't be trivial, that seems clear, and the ideal patch style for GeoJSON could differ from the ideal style for other applications. I'm certain that we'd need to get to patches like the latter to make GeoJSON patching something that developers would consider adopting. JSON PATCH provides a useful grammar but by itself isn't enough. There's more work to be done.

Did we ever consider making FeatureCollection's features key have an object value? I'll look in the GeoJSON list archive when I have time.

YLT covers NRBQ's Ridin' in My Car

I know everybody else is using Pandora and Spotify and what not these days, but here at home we're doubling down on actual 87.7 to 108.0 MHz radio. We give to KCFR (Colorado Public Radio) and WFMU (Jersey City, New York, the Internet). I'm wearing my WFMU donor swag T-shirt from 2012 in this photo. This past March 14, while I was packing to go to Santa Clara for PyCon, I made a song request during Yo La Tengo's all-request fundraising show. The band played it, I nearly wet myself with glee... and now just a few minutes ago, after searching for the first time in months (I'd almost given up), I've found it on SoundCloud and am experiencing uncomfortable levels of glee all over again: https://soundcloud.com/fonsessions/ridin-in-my-car-nrbq-cover-yo?in=fonsessions/sets/yo-la-tengo-wfmu-all-request.

http://farm8.staticflickr.com/7459/11247313594_2f9dc520d8_c_d.jpg

Definitely a highlight of 2013.

First blog post at MapBox

I've made my first post for the MapBox blog: Mapping the open web using GeoJSON. It's a roundup of products and services at work that employ GeoJSON and a bit of an explanation of why the format fits MapBox so well. I work at a company that puts developers first, and GeoJSON is a format that also puts developers first.

What I didn't mention in the post is how much I see the format used for getting things done everyday at work. Cutline for gdalwarp? GeoJSON. Need to share some data in a Gist with another developer? GeoJSON. Quick map to help someone get from A to B? GeoJSON. Of course this is what the GeoJSON authors were hoping to make possible, but it's pretty exciting to see it happening, and in unanticipated ways.

Blogging at MapBox is super slick thanks to the engineering work Mike Morris wrote about recently and I found the amount and quality of feedback fascinating – seven different people collaborated with me on the post in one way or another. I'm looking forward to the next one.

JSON diff and patch for GeoJSON

We diff and patch source code all the time, but doing the same thing for JSON data isn't as straightforward. More often than seeing the differences between JSON texts, we're interested in seeing (and sharing and reapplying) the differences between the objects encoded in the text. We want to gloss over differences in whitespace and order of object members and see the real difference in the data. Unix patch and diff operate only on text, and best on text of consistent and short line length, and aren't very useful for parsing out differences between objects encoded in JSON. If diff and patch aren't going to work, what will?

RFC 6902, JavaScript Object Notation (JSON) Patch is one attempt at standardizing JSON patches. From the introduction:

JavaScript Object Notation (JSON) [RFC4627] is a common format for the exchange and storage of structured data. HTTP PATCH [RFC5789] extends the Hypertext Transfer Protocol (HTTP) [RFC2616] with a method to perform partial modifications to resources.

JSON Patch is a format (identified by the media type "application/ json-patch+json") for expressing a sequence of operations to apply to a target JSON document; it is suitable for use with the HTTP PATCH method.

This format is also potentially useful in other cases in which it is necessary to make partial updates to a JSON document or to a data structure that has similar constraints (i.e., they can be serialized as an object or an array using the JSON grammar).

RFC 6902 comes from developers interested in REST-style interaction with JSON representations of resources but might be more generally useful. How does it work? If you're familiar with XPath and XPointer, you might see some similarities. A JSON patch doc contains an array of path, value, and operation tuples. The path tells you how to traverse into the JSON document and the operations specify what is done to objects using the values.

For an example, I'm going to take GeoJSON from Lyzi Diamond's Learn GeoJSON project and make diffs using Stefan Kögl's python-json-patch package. RFC 6902 has implementations in other languages, too, such as jsonpatch-js (npm).

Here's a script that prints out the RFC 6902 diff between commits 44de76ef53 and e9514f5c31 of hackspots.geojson:

import json
import jsonpatch
import os.path
import requests
import subprocess

# original
if not os.path.exists('hackspots-44de76ef53.json'):
    r = requests.get('https://github.com/lyzidiamond/learn-geojson/raw/44de76ef53b20bdaf51e0cde4aa634df210cd9d4/geojson/hackspots.geojson')
    data = r.content
    with open('hackspots-44de76ef53.json', 'wb') as f:
        f.write(data)

# next commit
if not os.path.exists('hackspots-e9514f5c31.json'):
    r = requests.get('https://github.com/lyzidiamond/learn-geojson/raw/e9514f5c317ee980b94ed6580950cfd9fbde53db/geojson/hackspots.geojson')
    data = r.content
    with open('hackspots-e9514f5c31.json', 'wb') as f:
        f.write(data)

diff_text = subprocess.check_output([
                'jsondiff',
                'hackspots-44de76ef53.json',
                'hackspots-e9514f5c31.json'])
diff = json.loads(diff_text)
print(json.dumps(diff, indent=2))

The result looks, to me, like a very reasonable representation of an operation that adds one feature to a GeoJSON feature collection.

[
  {
    "path": "/features/2",
    "value": {
      "geometry": {
        "type": "Point",
        "coordinates": [
          -122.65496134757996,
          45.5212590460849
        ]
      },
      "type": "Feature",
      "properties": {
        "Name": "Three Friends Coffeehouse",
        "Address": "201 SE 12th Ave, Portland, OR 97214"
      }
    },
    "op": "add"
  }
]

A hypothetical change of this same feature's coordinates from e9514f5c31 to https://gist.github.com/sgillies/7797688#file-foo-geojson looks like this:

[
  {
    "path": "/features/2/geometry/coordinates/0",
    "value": -122.655,
    "op": "replace"
  },
  {
    "path": "/features/2/geometry/coordinates/1",
    "value": 45.522,
    "op": "replace"
  }
]

There's a potential for RFC 6902 style GeoJSON diffs to become very large if highly detailed lines and polygons are being modified (via simplication or reprojection). But GeoJSON itself is quite verbose already and a verbose patch representation feels not surprising. If you already think GeoJSON sucks, JSON patch for GeoJSON is going to look exta sucky because of the lengths of paths to individual coordinate items. I don't think there's any way around that, although GeoJSON's recursive coordinates member helps a little – consider that the path to the first 'x' coordinate of the exterior ring of the first part of a multi-polygon is "just" '/features/42/geometry/coordinates/0/0/0/0'.

I'm not sure how RFC 6902 patches will play out at scale, but I think it's worth further consideration.

Atom extension for Tinkerer

Tinkerer, the software I'm using for my blog, lacks a good Atom feed, so last night I wrote my first Sphinx extension to provide one. Rutherford adds a couple new config items and a full content Atom feed. It exposes the Tinker categories and tags and tracks the modification times of posted .rst files so that feeds and their entries get updated accordingly. My own feed is http://sgillies.net/blog/feed.atom.

I'm becoming a fan of the Sphinx extension story, at least for simple cases like mine. Python's packaging situation continues to get better and there are all kinds of wins for insisting that extensions (think QGIS plugins) be Python packages: support from pip and setuptools, version pinning, package dependency resolution, etc.

New home on GitHub for GeoJSON

Tim Schaub has acquired the geojson organization on GitHub from Jason Roebuck (thanks, Jason!) and moved the old GeoJSONWG org to https://github.com/geojson. If you've cloned any of the repos, adjust your remotes accordingly. More good GeoJSON news coming soon.