2011 (old posts, page 1)

Is Pleiades a content farm?

More importantly, is Pleiades adequately distinguishable from a content farm? We don't host advertisements on Pleiades and don't make any money off a single page view or click-through (though we certainly hope that enough page views help make a case for continued funding), but Pleiades does have one characteristic that's becoming identified with content farming: shallow or low-quality content.

To see what I mean, take a look at our source material for Pleiades. We have many long tables with a few columns of very terse information:

Grid  Name                Period  Modern Name / Location  Reference
-----------------------------------------------------------------------------
B2    Acqua Acetosa       A       Laurentina              Bedini 1981
B2    Acquafredda         A       Rossi                   Diana 1984
B2    Ad Aquas Salvias    RL      Tre Fontane             Ferrari 1957, 33-48
...

From these records we've created resources like the one at http://pleiades.stoa.org/places/422802/. That one didn't initially have coordinates or a feature type, these have only come recently from our partner project, DARMC. Before that, we had a location that could only be described as low quality. The details of the resource remain sketchy. One might say "shallow". This has been our strategy for Pleiades: start broad and shallow, fill in and refine incrementally, with writers producing over time high quality place pages like:

Until recently, the strategy seemed pragmatic and sound. I may have even used some farming metaphors when describing Pleiades to people over the past few years. If I had a dollar for every time I've heard or read someone on "growing communities", I'd be typing this from my yacht in the Aegean. Times are changing: sites that go deep on topics like Stack Overflow are favored, sites that go broad and shallow are disfavored. I didn't pay enough attention to our search ranking before the new policy, but I see the count of our pages in Google's index dropping a bit over the last week. Until I see them bounce back up, I'd caution academic projects who are looking to be well placed in search results against taking the same approach.

Command line JSON

I finally found out about json.tool while debugging responses from the Pleiades spatial index server. It pretty-prints and validates.

$ curl "http://pleiades.stoa.org/places/442733/json" | python -mjson.tool
{
    "bbox": [
        15.005568999999999,
        40.422631000000003,
        15.005568999999999,
        40.422631000000003
    ],
    "features": [
        {
            "geometry": {
                "bbox": [
                    15.005568999999999,
                    40.422631000000003,
                    15.005568999999999,
                    40.422631000000003
                ],
                "coordinates": [
                    15.005568999999999,
                    40.422631000000003
                ],
                "type": "Point"
            },
            "id": "darmc-location-18805",
            "properties": {
                "description": "",
                "link": "http://pleiades.stoa.org/places/442733/darmc-location-18805",
                "title": "DARMC location 18805"
            },
            "type": "Feature"
        }
    ],
    "id": "442733",
    "type": "FeatureCollection"
}

Would it be worth adding such a tool to the geojson module?

Comments

Re: Command line JSON

Author: Seth G

It would be great to have a geojson validator (command line or web) along the lines of http://www.jsonlint.com/

I'm never sure if CRS (or SRS?) is required - GDAL seems to crash if it is missing. It would be nice to have a tool to check if there are missing coordinate values etc. (although loading with Shapely helpw with the latter).

Re: Command line JSON

Author: Sunng

python's json module can be considered as a formatting tool.

Recently I find a command line query tool for json called jshon, which is of great helpful.

http://kmkeen.com/jshon/index.html

Re: Command line JSON

Author: Howard Butler

Seth G,

File a bug on GDAL if its GeoJSON module is crashing in cases with or without SRS info. It shouldn't be required, and if it is crashing, it needs to be fixed.

Howard

Upcoming

On the 22nd of May, I will be arriving in Minneapolis for the North American edition of the FOSS4G conference and on the next day I will be presenting a talk entitled GeoJSON is Spectacularly Wrong. I had also submitted a Python GIS talk and was surprised and concerned when both were accepted because there's no way I could pull off two. I'd enjoy talking about Python programming a little more, but it looks like the community is slightly more interested in GeoJSON. I hope that I didn't chose the wrong topic. Thursday night I'll be at the fancy shindig and because open source GIS events always need more dudes I will be bringing my brother. Friday I'll be sticking around to hack and catch up with what's going on in the field. It's been a long time since I've been to one of these conferences. Maybe I can share my Python 3 porting experience or offer a "show me your Python script and I'll show you how to make it measurably better/faster, or your money back" service. Yes, like in 2003, I'm going to a Yo la Tengo show one of the nights (here in Colorado, which is why I'm arriving late). No, unlike in 2003 I'm not driving.

The following week I'm going to be at Drew University in Madison, NJ, for the second part of ISAW and Drew's Linked Ancient World Data Institute. Just between you and me, I'm a bit burned out on talking about linked data; it's long past time to just start producing and using linked data to get stuff done. I believe that's what we're going to try to do at Drew this year.

Comments

Re: Upcoming

Author: J. Jones

I was excited when I saw you appear on the preliminary program, and am of course looking forward to (or not) hearing the latest reminder of my beloved just-good-enough wire format. While I won't be in Minny, i'll eagerly await the posting of the full presentation. That, and whatever those great guys at the NWS are presenting Digital Weather forecasts with open tools (plug, plug).

Links

I figure three of the first steps towards making saucisson are learning how to grind meat, stuff it neatly in a casing, and make tidy links. For exercise number one, I followed the directions in Bruce Aidells' Complete Sausage Book and modified the recipe for "Chicken and Turkey Merguez", using lamb instead of fowl, espelette pepper in place of cayenne (for the kids), and Banyuls vinegar instead of lemon juice for the full Sud de France effect. The results are, I think, not bad for a first try.

http://sgillies.net/images/DSC_0325.jpg

We fried up the leftovers for lunch. Less fatty, a little more herby (parsley and mint), and less paprika than the ones we were getting from the marché in Montpellier. The kids loved it.

Our KitchenAid made grinding and stuffing a breeze, at least for this quantity. Natural sausage casing is remarkably puncture resistant and easy to handle. A little more practice and we should be able to run it faster and really crank out the links.

Does listing the ingredients make this open source sausage?

  • 2.25 pounds 90% lean lamb (85% might be better)
  • 1/2 medium onion + 2 large cloves garlic, chopped very fine
  • 1/4 cup parsley + 1/8 cup mint, fresh, minced
  • 1.5 tbsp kosher salt
  • 1/2 tsp granulated sugar
  • 1 tsp ground black pepper
  • 1/2 tsp espelette pepper (I'd do a tbsp if I had it)
  • 1.5 tbsp paprika
  • 1.5 tsp ground coriander seeds
  • 1.5 tsp ground fennel seeds
  • 1.5 tsp ground ground cumin seeds
  • 1/2 tsp powdered turmeric
  • 1/2 tsp ground allspice
  • 1.5 tbsp tomato paste
  • 1 tbsp Banyuls vinegar
  • 4 feet natural (intestinal submucosa) pork sausage casing

Comments

Re: Links

Author: Sean

It's the Vermeil from La Guinelle. I had the pleasure of meeting Mme. Herre at La Cave des Arceaux near the end of our séjour. We're going to run out of it soon and apparently they have (yet?) no point of sale in Colorado.

Get with it

I don't know a nicer way to say this: if your Python software requires you to routinely delete objects, it is letting you down. I'm looking at you, ArcPy cursors:

import arcpy

cur = arcpy.SearchCursor("roads", '"TYPE" <> 4'))

for row in cur:
    print "Name: %s,  CFCC code: %s" % \
          (row.NAME, row.CFCC)

del cur, row

The ArcPy cursor locks resources and this lock is removed (evidently, or update 2011-02-02: maybe not so evidently, see the comments) in the cursor object's __del__ method. This is a less than ideal design because, as stated in the Python docs, del ob does not immediately cause ob.__del__() to be called. To their credit, the ArcPy docs acknowledge this issue:

When working in a Python editor, such as PythonWin, you may need to clean up object references to remove dataset locks set by cursors. Use the gc (garbage collection) module to control when unused objects are removed and/or explicitly delete references within your script.

You're quite seriously failing your users if you make them turn to manual intervention in the garbage collection cycle, but the problem of finalizing the cursor state remains. What would be better? There's two obvious ways to go: emulate Python files (or the Python DB API), or use PEP 343's with statement.

In the first approach, one would add close methods to cursors that immediately and explicitly free up all locked resources. The end of the script above becomes:

for row in cur:
    print "Name: %s,  CFCC code: %s" % \
          (row.NAME, row.CFCC)

# This is too optimistic - finalizes only eventually, if ever
# del cur, row

# Better - finalizes immediately, let garbage collection delete it
# normally
cur.close()

The with statement is more powerful, but more complicated. The problem PEP 343's with statement try to solve is explained by Fredrik Lundh [effbot.org]:

As most other things in Python, the with statement is actually very simple, once you understand the problem it’s trying to solve. Consider this piece of code:

set things up
try:
    do something
finally:
    tear things down

Here, “set things up” could be opening a file, or acquiring some sort of external resource, and “tear things down” would then be closing the file, or releasing or removing the resource. The try-finally construct guarantees that the “tear things down” part is always executed, even if the code that does the work doesn’t finish.

It's a generalization of the problem an ArcPy cursor user faces: locking a data source during a session and returning the data source to its proper state at the end of the session. With context management, ArcPy could let a user write safe, foolproof code like this:

with arcpy.SearchCursor("roads", '"TYPE" <> 4')) as cur:
    for row in cur:
        print "Name: %s,  CFCC code: %s" % \
              (row.NAME, row.CFCC)

# That's all - cursor finalizes when the 'with' block ends

Even without changes in ArcPy, a user can start dodging cursor reference trouble right now by writing a few lines of adapter code.

class make_smarter:
    def __init__(self, cursor)
        self.cursor = cursor
    def __enter__(self):
        return self.cursor
    def __exit__(self, type, value, traceback):
        # self.cursor.close() would be better
        self.cursor.__del__()

with make_smarter(arcpy.SearchCursor("roads", '"TYPE" <> 4'))) as cur:
    for row in cur:
        print "Name: %s,  CFCC code: %s" % \
              (row.NAME, row.CFCC)

The make_smarter class is a "context guard" for the cur object. Its __enter__ method is called when the while block begins and its __exit__ method is called when the block ends, for whatever reason. Python's file object, since version 2.5, implements this very same protocol.

A context guard for writing data with ogr could be equally useful and hide the the unfortunately named method you call to flush data to disk.

class make_smartogr:
    def __init__(self, layer)
        self.layer = layer
    def __enter__(self):
        return self.layer
    def __exit__(self, type, value, traceback):
        self.layer.Destroy()

Take it from me, because I've felt the pain during development of Shapely: relying on the __del__ method of an object like ArcPy does – or does not (see comments), maybe it counts references, which I equally recommend against – will burn you sooner or later. The with statement adds dependability and safety to even the most straight forward data processing script.

Comments

Re: Get with it

Author: Kay

I noticed recently Spatial Analyst's raster-objects have similar problems. This is even worse in a way, as they are temporary files, that don't want to get deleted or overwritten before you set their object to None first.

Doing something like this just doesn't work:

>>> raster += 5

So I have to do this:

>>> raster2 = raster + 5

>>> raster = None

>>> raster = raster2

>>> raster2 = None

I could just use numpy of course, then the first statement would work just fine.

Re: Get with it

Author: FMark

Hi Sean,

I like the idea, but have you actually tried this? It seems the

Cursor

object doesn't have a

__del__

method:

>>> c = arcpy.SearchCursor(a)
>>> c

>>> dir(c)
['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__getattribute__', '__hash__', '__init__',
'__iter__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__',
'__subclasshook__', '__weakref__', '_arc_object', '_go', 'deleteRow', 'insertRow', 'newRow', 'next', 'reset', 'updateRow']
>>>

Re: Get with it

Author: Sean

No, haven't tried it. Maybe there's an lock owning attribute of the cursor with a __del__ of its own, or maybe not. I can imagine alternatives, but will stop guessing. Regardless, reference counting shouldn't be so visible in an API. The more explicit management of resources that I'm advocating remains the better way to go.

Re: Get with it

Author: FMark

I couldn't agree more. Especially since ESRI apparently went to so much trouble to make their API more pythonic.

Re: Get with it

Author: Jason Scheirer

This has already been implemented some for ArcGIS 10.1 with the new Data Access module cursors. Schema locking and object lifetime at the C++ level has gotten significantly more fine-grained across the board in the Python extensions.

Re: Get with it

Author: Sean

Fine-grained sounds good, but coarse or fine, a user shouldn't have to count references. At least not in Python ;)

Re: Get with it

Author: Jason Scheirer

Right, the new cursors both utilize the with statement to open/close connections and are smart enough, for example, to close the connection once a read cursor is exhausted (which is what I meant by finer-grained; the Python object no longer just a dumb holder of a reference to a C++ object that relies on __del__ to release). So no more dels and you can safely use with statements to handle cursors' schema locks.

XYZ

While answering a fun question on Stack Overflow, I was reminded of the fantastic utility of Python's built in zip when working with coordinates in space. If you're not using zip every chance you get, you are not a true GIS owning Pythonista. What the function does is "zip" sequences together.

>>> zip(['x', 'y', 'z'], ['1', '2', '3'])
[('x', '1'), ('y', '2'), ('z', '3')]

M number of N length sequences become N number of M length tuples. This is handy for geospatial programming because we often go back and forth between sequences of x, y (ignoring z for the now) pairs (vertex semantics) and separate arrays of x and y values (computational efficiency). I'm going to show here how to use zip along with pyproj and Shapely to calculate the surface area of the State of Colorado to a good approximation.

let's say you have a GeoJSON representation of Colorado's boundary:

>>> co = {"type": "Polygon",
...  "coordinates": [[
...    [-102.05, 41.0],
...    [-102.05, 37.0],
...    [-109.05, 37.0],
...    [-109.05, 41.0]
...  ]]}

Note that Colorado is not exactly a rectangle in any coordinate system: the surveyed boundaries exhibit kinks on the borders with Utah and New Mexico. The method I'm going to show isn't invalidated by the boundary anomalies.

First, let's project the geographic coordinates onto a plane using a pyproj equal area projection centered on and bracketing the state.

>>> from pyproj import Proj
>>> projection = Proj(
...     "+proj=aea +lat_1=37.0 +lat_2=41.0 +lat_0=39.0 +lon_0=-105.55")

The projection is callable, but it wants unzipped sequences of longitudes and latitudes, not zipped GeoJSON coordinates. Handily, zip also unzips in conjunction with the * operator:

>>> lon, lat = zip(*co['coordinates'][0])
>>> lon
(-102.05, -102.05, -109.05, -109.05)
>>> lat
(41.0, 37.0, 37.0, 41.0)

Next, we project these values and zip them up again to make a projected GeoJSON representation.

>>> x, y = projection(lon, lat)
>>> co_projected = {"type": "Polygon", "coordinates": [zip(x, y)]}

Finally, make a Shapely geometric object from the projected feature and take its area.

>>> from shapely.geometry import shape
>>> shape(co_projected).area
268952044107.43542

This value is 0.3% smaller than the surveyed area of the state.

Use of zip eliminates needless, slower (the built-in is implemented in C) Python code using an idiom that isn't alien to GIS. Once you start to think of coordinate pairs as "zipped" and coordinate arrays as "unzipped", it makes good sense.

The itertools module also provides a iterable version of this function: izip.

Comments

Re: XYZ

Author: Jeremy

Dojo has an implementation of zip/unzip in javascript.

http://servicesbeta.esri.com/dojox.functional/zip.html

Eugene has other goodies on his blog. http://lazutkin.com/blog/2008/jan/12/functional-fun-javascript-dojo/

Re: XYZ

Author: Grant

I've recently begun using zip for staggered iteration over a list of coordinates, like so:

import math
coords = [(0,0), (0,1), (2,6), (1,0)]
for cur,next in zip(coords, coords[1:]):
    delta = math.sqrt((next[0]-cur[0])**2 + (next[1]-cur[1])**2)

Stumbled on the idea here: http://bit.ly/gOPACj

Ancient Toponym of the Week: Diarroia

No, really: http://pleiades.stoa.org/places/376769/diarroia.

It so happens that our reference is in Google Books: Purcaro Pagano 1976, 334. Pagano finds this name in a sailing itinerary and places it on the Syrtis Maior, the ancient Gulf of Sidra. Have we propagated the mistake of a traveler's annotated misfortune taken for an ancient place or is this a true toponym used by ancient people to label a particularly infamous port of call?

Comments

Re: Ancient Toponym of the Week: Diarroia

Author: Leifuss

It's the real deal I'm afraid :-S

I came across it a while back as Ptolemy mentions it as well. Diarrhoia means 'flow through', so perhaps it has/d a stream running through it? Pretty handy in the Gulf of Sidra, I'd imagine...

Re: Ancient Toponym of the Week: Diarroia

Author: Sean

Thanks for the comment, Leif. That's a rather mundane origin for the name; I wonder why we don't see more occurrences.

Looks like we should add a Ptolemy reference for this resource.

Re: Ancient Toponym of the Week: Diarroia

Author: Sean

I discovered today that the Barrington compilers made a separate entry for Ptolemy's "Diarroia", and we've published that one at http://pleiades.stoa.org/places/376770. Reconciling these possible duplicates is one of the goals of Pleiades.

For 2011

A few of my goals, not resolutions, for 2011:

  • Improve my Vim skills.
  • Brush up on JavaScript programming. Don't get left behind on the server side.
  • Make some headway on a framework for analysis and inference of fuzzy and relative locations. Apply to Pleiades in particular.
  • Gain more experience with NLTK and text corpora and computational linguistics in general. Apply to Pleiades in particular.
  • Write more articles, particularly about the previous two goals.
  • Spend more time outside. I'm happier outside.
  • Introduce my kids to the Colorado Plateau.
  • Resume making beer at home. I've made almost 50 batches of beer, but none since 2004.
  • Maintain my French skills, mainly through music, movies, novels, bande dessinée; letters to friends, and to our French bank; and learning about tax refunds.
  • Dabble in charcuterie. Colorado is flowing over with good beer, but not so much with cured meat unless you count beef jerky.
  • Get more involved in the Fort Collins community.

Comments

Re: For 2011

Author: Chris

Great goals Sean - I particularly like the vim, javascript, and Colorado Plateau ones!

Have a good year, I'm sure we'll drink some beer at point this year.

Chris

Re: For 2011

Author: Bill

Great goals. As for the last one, you'll find that kids have a way of leading you into that. Enjoy your upcoming year!

Bill

Re: For 2011

Author: Norman

as for charcuterie, if you need your fix of saucisson, then Longmont Cheese Importers has the lot. I get my fix of english things from there (e.g. Plum Pudding) when we need to :) NLTK looks interesting.

Re: For 2011

Author: Gaurav

Hey

I keep visiting your blog as my research is close to what you are doing...was happy to see that there is Natural language processing (NLTK) involved in your work...i am also working with NLP tools and there are few goodone available as well GATE, Stanford CoreNLP can also be used...

Best wishes and Good luck for your goals!

Gaurav

Re: For 2011

Author: Adrià

As for the BD side, I'm sure you may already know them, but let me recommend you Christophe Blain (last one is brilliant: http://goo.gl/5uhoO) and Joan Sfaar (http://goo.gl/X07r9). I always enjoy their work.

Re: For 2011

Author: Sean

Thanks for the recommendations, Adrià. I recently read Sfar's "Chasseur-Cueilleur" and have been meaning to start on another series.