jducoeur: (Default)
I'm getting to the point of diminishing returns, so it's getting to be time for me to give up on trying to polish the data; please forgive the duplications that make their way into the final online Order of Precedence, which will have to be merged by hand after it goes live. I've eliminated many thousands of duplicate records, but I'd be surprised if there are less than a few thousand that make it in. (There are still about 9000 incomplete records -- more than the 6000 I was targeting, but I think we'll have to live.)

But the system continues to be disconcertingly smart. Today, it complained to me that we had duplicate alpha entries for "Elizabeth Vynehorn" and "Muirne ni Cormaic", which led me on a merry chase: I couldn't figure out *why* it had decided that they were synonyms (I have begun to regret not building a system that records the reasoning, which gets pretty subtle and obscure from run to run), but fortunately found her LJ -- I hadn't realized that Muirne had changed her name. So I've updated my copy of the old OP accordingly.

Oh, in case anybody is interested -- one artifact of this project will be my final master copy of the old HTML files. These are massively cleaned-up HTML, and have many errors and duplications of this sort fixed. Folks are welcome and encouraged to refer back to these files after the new system goes live, since they are the data that the new OP will be bootstrapped from. The "alpha", "awards" and "chrono" directories roughly correspond to the files on op.eastkingdom.org, but with a great deal of massaging.

And the record for longest "alternate names" field goes (no surprise) to Mistress Nataliia Anastasiia Evgenova Sviatoslavina vnuchka, whose name is so long, and *never* spelled quite correctly in the Court Reports, that she winds up with 515 characters of alternate name field so far. (Far more than the 255 allowed -- I had to introduce some trimming code to keep her entry from breaking the database. I think she'll survive without every single misspelling recorded for posterity in her record.)

Anyway, continuing to plow through, and finish the current round of synonyms. When it is asking me whether Nathaniel Wyatt and Karrah the Mischevious are the same person, we're definitely running out of good guesses. (Yes, there was a reason -- they apparently were inducted into the White Oak the same day. Still, not exactly a high-quality guess...)
jducoeur: (Default)
[This one's gonna get pretty technical; be warned. It's kinda bragging (in the "look at the size of my brain" sense), but dammit, I have spent a *lot* of time on the bloody OP Compiler, and I need to get at least a little ego-boo out of it. Programmers may actually want to give it a deep read, since it wound up as an exercise in practical data-mining.]

I've talked before about the Order of Precedence Compiler project. I'm taking the old, flat-file OP, "compiling" it into a nice normalized database, and spitting it out into MySQL format for the database system that [livejournal.com profile] tpau picked out. That's mostly done: I'm reading in nearly all of the files successfully, writing them out, and we're able to at least mostly run the new system with the old data.

There's one huge snag, though: the data is *magnificently* inconsistent. This isn't really about typos -- while there are a few errors and inconsistencies here and there, the vast majority of the problem is much more pedestrian, with two main causes:
  • People change their names a lot.

  • A lot of SCAdians have hard-to-spell names, that they don't use consistently.
Why is this a problem? You'll recall that we have three sets of data -- the Court Reports (in the form originally recorded, more or less), the Alphabetical Listing (which generally is indexed by folks' preferred form of their name), and the List by Award (for each award, everybody who has ever received it, in order). It is not unusual for a given award to be recorded under two or even three different names in these three lists.

And the hell of it is, the system has no idea that those are the same person. This is where a hand-maintained system is very different from a program: it might be horribly obvious to a human that (picking an example at random) "Nigell Tarragon" in the alpha list is the same person as "Nigel Tarragon" in the court report, but the program doesn't know that. The names are different, period. And often it's not just one character difference: there are entire words missing, first and last names switched, or the *extremely* common case where somebody changes their name after getting their AoA.

The result was that the new award system was staggeringly full of duplicate entries: multiple records of somebody getting an award with different names, when it should be a single record with alternate names. What to do?

My original reaction was that I was going to have to mount a massive effort, recruit dozens of people to scour the data meticulously and look for these duplications. But that was going to take dozens or hundreds of man-hours, and would still be hugely error-prone. So about three weeks ago, I paused and decided to step up and Be a Programmer.
ExpandNow we get the really technical stuff )
So that's the state of the data-cleanup. I invite those who like Data to come take a look at the current state of the synonym file -- my gigantic list of alternate personae, misspellings, and so on. It's over 2000 entries so far, nearly all of them found by the program. Many are fairly trivial -- by and large, I've been pretty conservative, only accepting proposals when I'm reasonably confident that they are correct -- but it's done a nice job of spotting completely different alternate personae that I just happen to know are right.

Tell me if you find any errors in the synonym file -- it wouldn't surprise me if there are a few, and I'd rather catch them now than later. (Although fixing the occasional problem in the new system won't be hard.) In the meantime, I'm continuing to plug away, and improve the data as much as I can in the relatively little time we have left before the new system goes live...

Profile

jducoeur: (Default)
jducoeur

June 2025

S M T W T F S
12 34567
891011121314
15161718192021
22232425262728
2930     

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

Expand All Cut TagsCollapse All Cut Tags