jducoeur: (Default)
[personal profile] jducoeur
[This one's just for the programmers in the audience; the rest of you should skip ahead.]

Perhaps the nicest thing about the OP Compiler project is that it's giving me the chance to really get sharp on Scala -- to get a sense of how to program the language idiomatically, the way it's supposed to be used, instead of just being transliterated Java. Occasionally, I write a few lines and marvel at how right and tight they are. Here's an example, which illustrates several elements. (None of which will be surprising to experienced functional programmers, but this stuff is new to me.)

Let's deal with the following problem. The OP Alpha listing consists of tables of a persona name, followed by date/award pairs. The problem at hand is that the "award" part often contains an attribution in parentheses, which is essentially noise -- a comment from my POV. For example:
Queen's Honor of Distinction (Jana IV)
The bit in the parens is messing up parsing the award name, so I need to separate it out.

In many languages, that would take a fair number of lines, but in Scala it turns out to be essentially four (could be more concise, but this seems clearest):
val awardCommentRegex = new Regex("""^(.*?) \((.*)\)$""", "name", "comment")
val commentMatch = awardCommentRegex.findFirstMatchIn(awardName)
val comment = commentMatch map (_.group("comment"))
val parsedAwardName = (commentMatch map (_.group("name"))).getOrElse(awardName)
Breaking that down:
  • The input is "awardName" -- that's the field I'm trying to parse.

  • The Regex is a fairly conventional regular expression; if it matches, it breaks the discovered groups into "name" and "comment".

  • The assignment to commentMatch does the actual regular-expression matching. That returns Option[Match] -- that is to say, the result contains either "Some(m)", where m is the found match information, or "None". In general, idiomatic Scala uses Option frequently for cases like this, where a function might return a value and might not; it is much safer than returning nulls, and avoids the usual mess of inventing return codes.

  • The assignment to comment does a "map", which basically keeps the exterior structure of a collection but changes the interior. In this case, it is transforming the Option[Match] to an Option[String], by extracting the matched comment if there was one. Again, if nothing was matched, it returns None.

  • The assignment to parsedAwardName is similar, but this time I want to get a definite String out the end, not an Option[String]. So first I fetch an Option[String]. Then the getOrElse() method either fetches the guts of that Option -- the String itself -- or, if the value is None, returns the original awardName that I started with.
Mind, everything here is strongly-typed -- Scala insists on strong typing throughout, so everything is very safe and errors get caught early. (Indeed, despite being newish to the language, I'm making very few runtime errors.) It's almost as concise as possible due to Scala's type inference -- while I'm not *declaring* object types above, that's because they are redundant, and Scala will simply figure them out for me. (The Eclipse plugin shows the inferred types when you hover over a name, so you can quickly check your work.)

No deep message here -- I'm just enjoying the elegance of it. I've always thought that I would really like working in pure Scala, and so far that's proving correct. And bit by bit, I'm absorbing the functional-programming models, and coming to appreciate that they have evolved far enough to often be much more concise than the comparable imperitive code...

(no subject)

Date: 2012-08-24 01:52 pm (UTC)
From: [identity profile] marphod.livejournal.com
If the comment is just noise, why not use a regex like this:


"^(.*?)( \(.*)?$"


then _.group("name") will be what you want, regardless of the presence of a comment. (I don't know how greedy Scala's regex are by default; you might need to change the greediness to get this to work right.)

If you are using the comment for something, you can do the same regex as you are currently using, but change the end to:

... " ?(\((.*)\))?$"

(Again, possibly adjusting greediness.)

(no subject)

Date: 2012-08-24 02:17 pm (UTC)
From: [identity profile] marphod.livejournal.com
I guess my background is showing when I'd want to run a sed script to normalize the data before parsing it. Strip parens, add field demarcation between the Title and Comment, or after the Title on lines without a comment, etc. =)

(Hunh. I wonder what happens if you give different groups in the Regex the same string title? Does the Regex constructor throw, does the parsing fail if more than one of those fields is non-None, or does some data get hidden?)

(no subject)

Date: 2012-08-24 02:04 pm (UTC)
From: [identity profile] corwyn-ap.livejournal.com
In a former life I would have written:

Parse (Input, Name, ^option("(", Comment, ")"));

But a lot is hidden in that.

(no subject)

Date: 2012-08-24 06:41 pm (UTC)
From: [identity profile] dragonazure.livejournal.com
Don't discount the parenthetical information as "a comment" if it isn't already captured in some other way. Based on the limited information you've presented, I think it probably should be another field that indicates what monarch(s) or coronets(s) bestowed the award.

This is a feature of the Atlantian OP (and and something I leveraged during my tenure as Clerk Signet) that the scribes sometimes use to work their way through backlogged awards for a given reign and could be useful for other statistical purposes.

(no subject)

Date: 2012-08-24 08:50 pm (UTC)
From: [identity profile] dragonazure.livejournal.com
Sorry, but it was an MS Access database when I was Clerk Signet. We had the monarchs/reign as a separate table and cross-referenced to the award record via a numerical index to the monarch/reign table. I didn't deal with Baronial awards other than the occasional courtesy if someone remembered to CC me with the court report. From outward appearances, it seems that Cassandra may have expanded that table to include Territorial Barons & Baronesses--but she may have developed some other scheme to manage the information. I haven't seen the new database schema, but I don't think it would be radically different from what I was using (as the Clerk Signet database was an extension of the original OP database design)

The foreign award concept was not part of my database, but it definitely is in the OP....

It's been a long time since I did any meaningful compiler work, but it looks like you are going to need to do a little bit of lexical analysis of the flat files in addition to your parsing. To me, that means you may have a fairly large "case" statement to filter the monarch/coronet-oriented awards versus a comment indicating a branch. Unfortunately, unless you can guarantee the syntax is consistent, you still are likely to have some manual cleanup to do before or after. :/

(no subject)

Date: 2012-08-24 08:28 pm (UTC)
laurion: (Default)
From: [personal profile] laurion
Yup. Good code that follows the 'does what it says on the box' principle. i.e., code that perfectly matches well reasoned out logic. May let you skip the pseudocode steps frequently needed in going from logic to code.

Profile

jducoeur: (Default)
jducoeur

June 2025

S M T W T F S
12 34567
891011121314
15161718192021
22232425262728
2930     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags