Data is All
Errors in the Gutenkarte picture of Pride and Prejudice remind me of a truth universally acknowledged, that data is everything in this game. Of any of the works offered in the demo, I know this novel best. I read it last year, and watched half of the classic 1995 BBC mini this past weekend. Gutenkarte is confusing character names with place names: Wickham, Fitzwilliam, and Bingley are the most outstanding errors of this class. Not only will Gutenkarte require an all-encompassing place name gazetteer, but also a database of characters, and the sense to exclude one or the other. Works of fiction will also be challenging to geocode. In which database can one find the location of Pemberly, Longbourn, or Netherfield? Until this data is in hand, Gutenkarte can't geocode fictional -- or ancient, by that matter -- works to satisfaction.
By the way, I thought Bride and Prejudice [2004] was hilarious, and Pride and Prejudice [2005] pretty good as well.
Update: Jo also notes Gutenkarte's issue with names.
Comments
Re: Data is All
Author: Joel Lawhead
A database of characters would work but it would be a brute-force solution. The real problem is MetaCarta's "natural-language geoprocessor" is simply a keyword search rather than actually parsing sentence structure and trying to derive context and meaning. A true natural-language parser (in Python of course) looks more like this: http://web.media.mit.edu/~hugo/montylingua/Re: Data is All
Author: Jo Walsh
Joel, MetaCarta's "geographic text search" does a lot more than keyword matching, but as it's a proprietary / patented codebase, we can't know how much more is really going on ;). FWIW, Schuyler says he set the "confidence threshold" for gutenkarte very low, preferring to get false positives rather than false negatives... Another comparative service to this is GeoNames' georss geocoder, which I also heedlessly accused of a brute-force approach, then had to eat my words: http://mappinghacks.com/2006/06/19/there-will-if-necessary-be-a-grass-roots-remapping/#comments http://www.geonames.org/rss-to-georss-converter.html I'm guessing a core problem with this approach is that it's relatively easy to get 85% of the way, staggeringly hard to get it right on the other 15%. And they claim we'll have strong AI by 2026 ... ;)