Including GDAL and GEOS in binary wheels that go to the Python Package Index (PyPI) has been good for users of fiona, rasterio, and shapely, but has also exposed Python GIS programmers on OS X to a perplexing bug. Shapely's issue #553 has been weighing on me for a long time. It's been hard to understand, hard to explain, and bites very hard. The short program below will trigger the bug on any version of OS X with any version of fiona and any version of shapely before 1.7a2 which is installed using a wheel from PyPI.
The unary_union function uses a GEOS STRtree and an assertion that should never be reached is in fact reached and the program aborts.
There has been a work-around – import shapely before importing other modules that are dynamically linked with GEOS, such as fiona and rasterio – but that's smelly and hard to ensure in a complex project.
The problem exists in a small niche: one operating system and one kind of
program, one that that loads GEOS twice in different ways. Once when fiona's
extension modules are loaded and cause the linked GEOS library to be loaded as
with any compiled program, and once again when shapely is imported and it calls
dlopen from python to load GEOS. I've found no diagnosis or solution of
a similar problem on the web. Was the problem in GEOS? Was it in the library
files that I built for the fiona, rasterio, and shapely wheels? Was the bug in
delocate, the program I use to find dependencies and bundle them into the
wheels? I asked GEOS developers and folks at work, to no avail. I was stumped
for a long time.
I asked Even Rouault about the issue and he suggested a solution might be found
in being more careful about using the correct mode with
dlopen and he
offered some example Python code that included this:
handle = CDLL(None).
My first thought was "can we use the ctypes CDLL class like this?" The docs say
that the first argument is a path to a library file and that's how I've been
using it in Shapely.
While looking for guidance on passing
None to the CDLL constructor, I found
some commented code in https://eli.thegreenplace.net/2013/03/09/python-ffi-with-ctypes-and-cffi.
The OS X dlopen man page doesn't exactly say that,
dlopen() examines the mach-o file specified by path. If the file is compatible with the current process and has not already been loaded into the current process, it is loaded and linked. After being linked, if it contains any initializer functions, they are called, before dlopen() returns. dlopen() can load dynamic libraries and bundles. It returns a handle that can be used with dlsym() and dlclose(). A second call to dlopen() with the same path will return the same handle, but the internal reference count for the handle will be incremented. Therefore all dlopen() calls should be balanced with a dlclose() call.
If a null pointer is passed in path, dlopen() returns a handle equivalent to RTLD_DEFAULT.
but the POSIX spec is clear that
dlopen(NULL) returns a handle to a global
symbols object and that's what I observe on my MacBook. Here's an interpreter
session where I import fiona and look in the global symbols object for a GDAL
C API function.
When fiona is imported, Python calls dlopen with the path to fiona's extension (.so) modules, and as these are liked with libgdal, the GDAL library is loaded. Now, libgdal is linked with libgeos_c, so will we find GEOS C API functions in the global symbols object?
Yes. This changes everything. The problem that has bedeviled shapely users is
caused by using dlopen to load a library that's already been loaded. We can avoid
the problem by detecting whether libgeos_c has already been loaded and skipping
the troublesome dlopen call. I've done this for shapely and made a new 1.7
pre-release. Please considering running
pip install -U --pre shapely to try
The OS X man page for dlsym says that looking up functions in the global handle is the slowest possible approach, but ctypes does cache the addresses of functions it finds and my test programs don't run measurably slower, so this just might work out.
I'd like to understand why this problem doesn't also occur on Linux. Is it because of differences in the library loaders or the executable file formats themselves? Is there subtle platform-specific bug in GEOS? I'll write more if I get to the bottom of this.