No clue, although it seems like a bad idea. (You can, of course, access groups by index as well -- the names are mainly a convenience, far as I can tell.)
As for the sed script -- possibly, but not nearly as easy as it sounds. Mind, most of the code I've written so far *is* normalizing the data. That a complex process, because there are so many irregularities at so many levels, ranging from badly-formed XHTML tags all the way up to the numerous ways that various award names get written. So I do a little pre-processing (TagSoup to turn the messy HTML into at least *legal* XHTML), but much of the normalization needs a lot of semantics so that the various syntactic structures can be handled appropriately in different places.
(For example, parens in the name at the top of an alpha listing are very semantically different from ones in an award listed inside it. I haven't even talked about the complex code to deal with all the different ways in which cross-references are described, which involves *optional* parentheses...)
(no subject)
Date: 2012-08-24 02:48 pm (UTC)As for the sed script -- possibly, but not nearly as easy as it sounds. Mind, most of the code I've written so far *is* normalizing the data. That a complex process, because there are so many irregularities at so many levels, ranging from badly-formed XHTML tags all the way up to the numerous ways that various award names get written. So I do a little pre-processing (TagSoup to turn the messy HTML into at least *legal* XHTML), but much of the normalization needs a lot of semantics so that the various syntactic structures can be handled appropriately in different places.
(For example, parens in the name at the top of an alpha listing are very semantically different from ones in an award listed inside it. I haven't even talked about the complex code to deal with all the different ways in which cross-references are described, which involves *optional* parentheses...)