![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
[This one's just for the programmers in the audience; the rest of you should skip ahead.]
Perhaps the nicest thing about the OP Compiler project is that it's giving me the chance to really get sharp on Scala -- to get a sense of how to program the language idiomatically, the way it's supposed to be used, instead of just being transliterated Java. Occasionally, I write a few lines and marvel at how right and tight they are. Here's an example, which illustrates several elements. (None of which will be surprising to experienced functional programmers, but this stuff is new to me.)
Let's deal with the following problem. The OP Alpha listing consists of tables of a persona name, followed by date/award pairs. The problem at hand is that the "award" part often contains an attribution in parentheses, which is essentially noise -- a comment from my POV. For example:
In many languages, that would take a fair number of lines, but in Scala it turns out to be essentially four (could be more concise, but this seems clearest):
No deep message here -- I'm just enjoying the elegance of it. I've always thought that I would really like working in pure Scala, and so far that's proving correct. And bit by bit, I'm absorbing the functional-programming models, and coming to appreciate that they have evolved far enough to often be much more concise than the comparable imperitive code...
Perhaps the nicest thing about the OP Compiler project is that it's giving me the chance to really get sharp on Scala -- to get a sense of how to program the language idiomatically, the way it's supposed to be used, instead of just being transliterated Java. Occasionally, I write a few lines and marvel at how right and tight they are. Here's an example, which illustrates several elements. (None of which will be surprising to experienced functional programmers, but this stuff is new to me.)
Let's deal with the following problem. The OP Alpha listing consists of tables of a persona name, followed by date/award pairs. The problem at hand is that the "award" part often contains an attribution in parentheses, which is essentially noise -- a comment from my POV. For example:
Queen's Honor of Distinction (Jana IV)The bit in the parens is messing up parsing the award name, so I need to separate it out.
In many languages, that would take a fair number of lines, but in Scala it turns out to be essentially four (could be more concise, but this seems clearest):
val awardCommentRegex = new Regex("""^(.*?) \((.*)\)$""", "name", "comment")
val commentMatch = awardCommentRegex.findFirstMatchIn(awardName)
val comment = commentMatch map (_.group("comment"))
val parsedAwardName = (commentMatch map (_.group("name"))).getOrElse(awardName)
Breaking that down:- The input is "awardName" -- that's the field I'm trying to parse.
- The Regex is a fairly conventional regular expression; if it matches, it breaks the discovered groups into "name" and "comment".
- The assignment to commentMatch does the actual regular-expression matching. That returns Option[Match] -- that is to say, the result contains either "Some(m)", where m is the found match information, or "None". In general, idiomatic Scala uses Option frequently for cases like this, where a function might return a value and might not; it is much safer than returning nulls, and avoids the usual mess of inventing return codes.
- The assignment to comment does a "map", which basically keeps the exterior structure of a collection but changes the interior. In this case, it is transforming the Option[Match] to an Option[String], by extracting the matched comment if there was one. Again, if nothing was matched, it returns None.
- The assignment to parsedAwardName is similar, but this time I want to get a definite String out the end, not an Option[String]. So first I fetch an Option[String]. Then the getOrElse() method either fetches the guts of that Option -- the String itself -- or, if the value is None, returns the original awardName that I started with.
No deep message here -- I'm just enjoying the elegance of it. I've always thought that I would really like working in pure Scala, and so far that's proving correct. And bit by bit, I'm absorbing the functional-programming models, and coming to appreciate that they have evolved far enough to often be much more concise than the comparable imperitive code...
(no subject)
Date: 2012-08-24 01:52 pm (UTC)"^(.*?)( \(.*)?$"
then _.group("name") will be what you want, regardless of the presence of a comment. (I don't know how greedy Scala's regex are by default; you might need to change the greediness to get this to work right.)
If you are using the comment for something, you can do the same regex as you are currently using, but change the end to:
... " ?(\((.*)\))?$"
(Again, possibly adjusting greediness.)
(no subject)
Date: 2012-08-24 01:59 pm (UTC)So it's "noise" from the viewpoint of trying to normalize the data. But it's still data...
(no subject)
Date: 2012-08-24 02:17 pm (UTC)(Hunh. I wonder what happens if you give different groups in the Regex the same string title? Does the Regex constructor throw, does the parsing fail if more than one of those fields is non-None, or does some data get hidden?)
(no subject)
Date: 2012-08-24 02:48 pm (UTC)As for the sed script -- possibly, but not nearly as easy as it sounds. Mind, most of the code I've written so far *is* normalizing the data. That a complex process, because there are so many irregularities at so many levels, ranging from badly-formed XHTML tags all the way up to the numerous ways that various award names get written. So I do a little pre-processing (TagSoup to turn the messy HTML into at least *legal* XHTML), but much of the normalization needs a lot of semantics so that the various syntactic structures can be handled appropriately in different places.
(For example, parens in the name at the top of an alpha listing are very semantically different from ones in an award listed inside it. I haven't even talked about the complex code to deal with all the different ways in which cross-references are described, which involves *optional* parentheses...)
(no subject)
Date: 2012-08-24 02:00 pm (UTC)(no subject)
Date: 2012-08-24 02:04 pm (UTC)Parse (Input, Name, ^option("(", Comment, ")"));
But a lot is hidden in that.
(no subject)
Date: 2012-08-24 02:05 pm (UTC)(no subject)
Date: 2012-08-24 06:41 pm (UTC)This is a feature of the Atlantian OP (and and something I leveraged during my tenure as Clerk Signet) that the scribes sometimes use to work their way through backlogged awards for a given reign and could be useful for other statistical purposes.
(no subject)
Date: 2012-08-24 07:07 pm (UTC)I'd actually planned to have a separate bestowal concept, but decided not to bother when I looked at the DB and couldn't find anywhere that the information could be stored. It's extra work to parse it out -- I need to distinguish between the cases "Queen's Cypher (Elspeth)" vs. "Perseus (Carolingia)", where the very common latter case is describing which *branch* the award comes from. So I'm only going to bother if I can figure out a productive way to represent it in the database. None of the fields jumped out at me as an appropriate place to record bestowal, though.
Mind, I am *also* deriving Court Reports from the existing OP, and the bestowal information is very nearly always implicit from that -- in general, the comment in the Alpha list is usually redundant with the Court Report. But I recognize that there are probably exceptions (especially for out-Kingdom awards), and I'm open to populating that if I can figure out where it's supposed to go...
(no subject)
Date: 2012-08-24 08:50 pm (UTC)The foreign award concept was not part of my database, but it definitely is in the OP....
It's been a long time since I did any meaningful compiler work, but it looks like you are going to need to do a little bit of lexical analysis of the flat files in addition to your parsing. To me, that means you may have a fairly large "case" statement to filter the monarch/coronet-oriented awards versus a comment indicating a branch. Unfortunately, unless you can guarantee the syntax is consistent, you still are likely to have some manual cleanup to do before or after. :/
(no subject)
Date: 2012-08-24 08:28 pm (UTC)