jducoeur: (Default)
[personal profile] jducoeur
So as I've mentioned before, my current programming project is the OP Compiler: taking the existing Order of Precedence and taming it with code, so that it can get fed into a nice, neat, vastly easier-to-maintain database going forward. I figured it would be a meaty but reasonably straightforward project -- after all, Caitlin was inhumanly good with data, and so the old HTML files should be at least *reasonably* consistent, right?

I am beginning to realize that my assumptions were incorrect. Caitlin *was* fabulous with data, and the flat files are perhaps more consistant than any other person could possibly have managed. But even she was human, and dealing with data from a zillion sources, with nothing automated checking the details.

So now I'm up to the point where I am successfully "compiling" a fair number of chronological court-report files (around the past eight years' worth), and all of "A" in the alphas, and I'm finding just how impossible the job had been. Everything *looks* great, and I don't think one person in 100 would catch more than a tiny number of errors. But besides the structural irregularities that I've been pulling my hair out over (mind, those huge files are completely hand-edited HTML, and the format isn't even remotely as consistent as it looks on the screen), it turns out that there are tons of *tiny* data bugs.

Let's just take the King, for example. Now that I'm actually able to print out what the compiler thinks is going on, I find that "Kenric Burn of Northampton" has a Valiant Tyger; "Kenric of Warwick" was a Rattan Champion, has a King's Cypher, and was named Crown Prince; and "Kenrick of Warwick" was Queen's Champion a couple of times, and got the Shield of Chivalry a couple of times. And when I get to parsing the K's, I'm going to have to rewrite his entry so that it cross-references all of these properly.

Mind, none of this is to fault Caitlin -- by and large, she was typing in what she was given by the heralds, and I'm 99% certain that she screened out 90% of the errors that were handed to her. But it is all demonstrating that the job of Shepherd's Crook really *is* impossible to do by hand, and it's miraculous that she managed to make it work as well as she did for as long as she did.

Anyway, the end result of all of this is going to be quite a substantial chunk of code. I am likely to open-source it, more as a way for me to kick the tires of Github than because I expect it to ever be used a second time (yes, I'm spending a solid two months writing a program that will, in the end, be run exactly once). But if anyone wants to see a medium-sized body of decently structured and not *excessively* cryptic Scala code, just pipe up and I'll be happy to point you at it and discuss what's going on in it...

(no subject)

Date: 2012-08-23 08:44 pm (UTC)
From: [identity profile] eclecticmagpie.livejournal.com
Piping up, here!

(no subject)

Date: 2012-08-23 08:45 pm (UTC)
From: [identity profile] zevabe.livejournal.com
the job of Shepherd's Crook really *is* impossible to do by hand

What was done in the old days? Are there more courts now? More awards? (i'd imagine that the Society grew a bunch since then, so there probably are in fact more award recipients).

(no subject)

Date: 2012-08-23 09:43 pm (UTC)
dsrtao: (glasseschange)
From: [personal profile] dsrtao
Among the problems I'm sure you've encountered:

- precedence attaches to the person, not the persona
- people can and do change not only their personas, but also the preferred spelling of it...
- and their legal names can change, too

(no subject)

Date: 2012-08-23 10:39 pm (UTC)
From: [identity profile] metahacker.livejournal.com
Are you storing all the variants with each entity, possibly even with which source each variant came from? (e.g., "entity-12345 was named Kenric Burn of Northampton as found in document-1234 dated 11/22/1996"...)

It'd be interesting to watch that evolution too, in some sort of graphical format.
Edited Date: 2012-08-23 10:39 pm (UTC)

(no subject)

Date: 2012-08-23 11:55 pm (UTC)
From: [identity profile] metahacker.livejournal.com
All excellent things to capture. Source Document is to me the other interesting bit. At some point Wikipedia became [citation required], and it became a series of pointers to the source for the info. Of course, the pointers become stale, but it's impractical to upload the entire internet; but at least you know where that factoid came from. (Theoretically.)

This is fascinating stuff when you're doing archaeology or genealogy or history. I guess it's why most genealogy apps let you paste in a scan of so-and-so's birth certificate right into their person-card. ("Well, his birthday was 1832 according to my mother, but according to this letter from my grandmother it was 1835, and his birth certificate says 1834 but the 4 is really smudgy. Let's record them all!")

tangentially

Date: 2012-08-24 06:52 pm (UTC)
ext_104661: (Default)
From: [identity profile] alexx-kay.livejournal.com
You've probably all read this already, but just in case you haven't: Falsehoods Programmers Believe About Names

(no subject)

Date: 2012-08-23 09:35 pm (UTC)
From: [identity profile] dulcinbradbury.livejournal.com
My name has had three variations, and more misspellings than I care to count. So yah... I hear you on this one.

(no subject)

Date: 2012-08-24 01:23 am (UTC)
From: [identity profile] aishabintjamil.livejournal.com
I can attest personally to the fact that Caitlin made corrections on the fly when someone pointed them out. Otherwise it probably would have been *much* worse. There was more than one instance where I sent her corrections and she cleaned up the offending entry post haste. I don't think nearly enough people realized how much of a labor of love that OP was.

(no subject)

Date: 2012-08-24 01:50 pm (UTC)
laurion: (Default)
From: [personal profile] laurion
Fascinating. I can tell you that working with student and faculty data, this is not an isolated problem. I'm constantly receiving communications and requests where names are misspelled, nicknames are used that aren't in official sources, unrecorded name changes have happened, or there are subtle name conflicts where multiple people share identical or near identical names. And with a bit of a fragmentary account system, this can be made worse when we sometimes end up with duplicate accounts....

(no subject)

Date: 2012-08-28 10:50 am (UTC)
From: [identity profile] crschmidt.livejournal.com
It's not just data about humans either; place names have a similar property. Working on a local search product has driven home how much differs in how humans both list and search about place names...

(no subject)

Date: 2012-08-28 10:46 am (UTC)
From: [identity profile] crschmidt.livejournal.com
For the record, I think that we spent about 6 months writing software that ended up being used exactly once at MetaCarta. Specifically, it was software for taking a giant software repo, pulling out all the bits that we either never should have had in the repo in the first place, or that we weren't intending to give over to the New Owners of the code, and producing an updated artifact with reasonably full history (including matching revision numbers so that commit comments didn't become wrong, iirc.)

In the end, all of the existing software was insufficient, and the end result was far too specific to be general purpose usable (though a lot of the bits got pushed back up into various related tools that we ended up using or repurposing).

Funnily enough, we did set up a script using off-the-shelf pieces originally -- which was going to take too long to run. "How long is too long" in this case was "Well, it took more than the 4 months we had before we had to shut down the server room it was running in and move it across town to our new location." (The final code ran in a couple days, I think, which was good, because it turned out we had to iterate on the final result several times; this wasn't a case of premature optimization, it was just 'off the shelf software won't solve this problem even if it runs for months'.)

Profile

jducoeur: (Default)
jducoeur

July 2025

S M T W T F S
  12345
6789101112
13141516171819
20212223242526
27 28293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags