Vectorizing shapely in 2020

Since its inception GeoPandas has used shapely geometry objects to represent the extent of things in data frames and has relied on shapely for geometry operations and predicates.

The problem is that GeoPandas performance has not been good enough to work rapidly on datasets with many millions of records. The root of the problem is that shapely has no fast and memory efficient C API for GeoPandas to use. Matthew Rocklin explains in Fast GeoSpatial Analysis in Python using the example of a GeoPandas distance method.

def distance(self, other):
    result = [geom.distance(other) for geom in self.geometry]
    return pd.Series(result)

This is slower and has a larger memory footprint than an implementation in C. Each item in the array of shapely geometry objects named self.geometry contains a pointer to a C struct. The size of the struct is approximately 16 bytes per coordinate value. The 1:10,000,000 scale Natural Earth Urban Areas dataset has 11,878 shapes with a mean of 96.4 coordinates, or roughly 1500 bytes, per shape. These coordinate values may be duplicated in the Python object depending on whether they've been accessed, which can quickly double the size of a large dataset in memory.

We can quickly see the raw speed difference between C and Python in the code snippets below. In the first, we compute the sum of one million floating point numbers stored in a Python array, entirely in interpreted Python. It takes about 400 microseconds on my laptop.

$ python -m timeit -s "import array; data=array.array('d', range(10000000))" "sum=0" "for val in data:" "    sum += val"
10 loops, best of 3: 406 msec per loop

In the second snippet, we compute the sum using Python's built-in sum(). It's 4 times faster because the loop is in C, temporary variables are in C and consistency of item type is assumed (see this SO answer for details).

$ python -m timeit -s "import array; data=array.array('d', range(10000000))" "sum(data)"
10 loops, best of 3: 98.7 msec per loop

We can compute sums in a way that's even faster, by storing the numbers in a numpy typed array and calling its sum() method. I see an 80x speedup in comparison to the entirely interpreted approach on my laptop.

$ python -m timeit -s "import numpy; data=numpy.array(range(10000000), dtype=numpy.float64)" "data.sum()"
100 loops, best of 3: 5.13 msec per loop