Itertools to the rescue
I'm waist deep in a messy data wrangling project with two preliminary conclusions: that Inkscape and the Gimp do not equal the best platform for collecting group control points for imagery, and that itertools has some great functional programming recipes.
I'm collating vertices of simple paths drawn in each application. Inkscape's SVG expresses the paths in the simplest fashion: M(ove to) X Y L(ine to) X Y .... Gimp exports them – pointlessly in this case – as curves:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN" "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd"> <svg xmlns="http://www.w3.org/2000/svg" width="109.417in" height="36.4722in" viewBox="0 0 7878 2626"> <path id="0102" fill="none" stroke="black" stroke-width="1" d="M 349.00,512.00 C 349.00,512.00 475.00,1245.00 475.00,1245.00 475.00,1245.00 481.00,2054.00 481.00,2054.00 481.00,2054.00 3792.00,2108.00 3792.00,2108.00 3792.00,2108.00 7555.00,2127.00 7555.00,2127.00 7555.00,2127.00 7618.00,1528.00 7618.00,1528.00 7618.00,1528.00 7738.00,588.00 7738.00,588.00 7738.00,588.00 3743.00,550.00 3743.00,550.00" /> </svg>
It's full of duplicate vertices: couplets at the ends and triplets in the middle of the path. Itertools provides a handy recipe for getting just the unique vertices:
from itertools import groupby, imap from operator import itemgetter def unique_justseen(iterable, key=None): "List unique elements, preserving order. Remember only the element just seen." # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B # unique_justseen('ABBCcAD', str.lower) --> A B C A D return imap(next, imap(itemgetter(1), groupby(iterable, key)))
Without having to think hard at all, I can get a list of locally unique pairs of numbers like [(349.00, 512.00), (475.00, 1245.00), (481.00, 2054.00), ...]. If you're using Python 2.5 (or older), without the builtin next() function, you'll need to replace it with an equivalent function like lambda x: x.next(). The itemgetter function from the operator module recasts Python's sequence item accessor in functional form:
>>> itemgetter(1)('ABCDEFG') 'B' >>> itemgetter(1,3,5)('ABCDEFG') ('B', 'D', 'F') >>> itemgetter(slice(2,None))('ABCDEFG') 'CDEFG'
The collated output I'm after looks like -gcp 349.0 512.0 723.90418 2348.0269 0.0 ... . You can probably see where I'm going with this.