jducoeur

Here's a question for the Perl and/or regexp experts in the audience; all help is solicited.

ProWiki has a query language built in. Simplifying greatly, the syntax looks like this:

{? [query terms] : [display results] ?}

This translates roughly as "for each page that matches the given query terms, show the display results, interpolating the properties of the page". That all works nicely, and is at the heart of what ProWiki does.

The problem is, I'd really like to be able to do this recursively. That is, I'd like to be able to construct a query like (to take today's example, one of many):

{?~Faction : %%Name%% -- {?~Character && Faction==%%PAGENAME%% : %%Name%% ?} ?}

That would translate as something like, "For each Faction, display the Faction's Name, and then for each Character in that Faction, display the Character's Name". Basically, nested foreach loops.

That's conceptually straightforward, but I'm stuck on how to parse it. ProWiki, being based on UseMod, uses Perl regex for its parsing. That mostly works fine, but I can't figure out how to get it to work recursively. I need to find the *matching* {? ?} pairs, extracting as plaintext any pairs that might be contained inside them. (The Perl code itself will then deal with the recursion into the plaintext subexpression.)

Can this be done straightforwardly in regex? It seems like a fairly common problem -- it's basically a fancy variant of parenthesis matching -- but I'm not hip enough to regex to see the answer. It's not simply a matter of matching first and last delimiters in the string, since a given page might contain several unrelated top-level expressions; therefore, I need to find the genuinely *matching* delimiters.

I know there are a bunch of Perl gurus out there, so if you can outline the solution to me (even the solution to the basic parenthesis-matching problem would probably show me how to do it), I'd be grateful. Thanks...

Flat | Top-Level Comments Only

From:

siderea

I am not studly enough to help you with your regexp question. :(

But I wanted to make a syntax suggestion based on my running "PHP vs. CF, Which Sucks More" log.

I highly recommend having a flavor of loop control which is foreachish, in that it runs once for each member of a set, but when nested, by default, inherits. A sort of loop, in which you won't have to write "&& Faction==%%PAGENAME%%" because that's the default.

Because in a web-results language, you do that all. The. Time.

I find I want to express your second concept in something VB-ish:

%%factions{.name .characters{.name}}%%

That probably wouldn't work with arbitrary objects, because English puralizes poorly.

From:

jducoeur

Innnnteresting. Good observations. I hadn't thought about that because I haven't used that construction before -- but of course, the only reason I haven't used it is that I can't parse it yet. I'm not going to change things for the older ProWiki system (which is nearing the end of its life), but this is worth thinking about for the next-gen Quirki system.

On the theoretical level, I twitch slightly, because there is no *formal* relationship between the Faction objects and the Faction property of the Character objects that happen to point to them -- the two type names match only by convention. That said, it's likely to be a common construction, and one worth encouraging for purposes of making clear and obvious databases, so providing neat defaults that DWIM if you follow that model makes a certain amount of sense.

Very interesting syntax suggestion; I'll give that some serious consideration. I've been tending to think in terms of a more explicit and detailed syntax (along the lines of typical programming-language expressions) because it's easy to see how to fully generalize it, but your suggested model is nicely clear. I need to think about whether there's a good middle ground, that provides both the clarity of yours and the power of mine. (This also plays into the question of how much people are going to be writing the expressions themselves, and how much using a WYSIWYG editor, which I'm still mulling.)

As for pluralization, it's an interesting point in context, since Rails (my likely platform for the next generation) *does* have a built-in pluralization mechanism. But I doubt I want to rely on that for anything -- I'd be willing to bet that it's quirky and often inaccurate...

From:

siderea

Glad you enjoyed that. Are you unfamiliar with CF? If so, at some point I should explain the CF CFOUTPUT syntax and usage to you. Not that I think you necessarily want to adopt them, but because having that paradigm floating around in your brain colliding with other stuff will probably be usefully inspiring.

there is no *formal* relationship between the Faction objects and the Faction property of the Character objects that happen to point to them

Why not? I would think that asserting all object/property names within a namespace/wikispace must be unique (except some special built-in keywords like "name" and "number" and "length" and "type", perhaps), so that a reference to an class as a property means a the property is a foreign key to the object. I'm trying to imagine a case where that is bad, and not coming up with something, though that may be an insufficiency of caffeine on my part.

From:

jducoeur

I'm only slightly familiar with CF -- what it is and does, but no details. (It's fairly unusual for me to be working with an application that relates closely enough to the DB for CF to be appropriate.)

I would think that asserting all object/property names within a namespace/wikispace must be unique

I'll think about it. The language geek in me somewhat rebels at the notion of having type relationships be implicit in the name like that, but I can't deny that there's a strong ease-of-use argument for it, and I'm not sure there are any compelling use cases against it...

From:

metahacker.livejournal.com

I hacked at this for a while, but it looks like one answer is "wait a few weeks":
http://dev.perl.org/perl6/doc/design/exe/E05.html
search for "Match nested parentheses maintainably"

Given that this is asking a regexp to act as a PDA it's not actually that simple...

From:

metahacker.livejournal.com

BTW, if all you want is non-recursive out to, say, 2 levels of nested parens, that's much easier. for the above example:

/\{\?~Faction : (\w+) -- (?:\{\?~Character && Faction==(\w+) : \1 \?\})* \?\}/
or something like that. The (?: foo) construct groups without returning as a match; the \1 harks back to the faction name or should.

{? and ?} are, of course, pathological in perl patterns and have to be escaped...who picked those delimiters? ;)

From:

jducoeur

Well, I can't do such an explicit match -- things like "Faction" and "Character" are all specific type names, and have nothing to do with the underlying parser. (They live at totally different level: the parser is part of the platform, and those types live in a particular database built *on* that platform.)

Still, it's an interesting point. For my current purposes, one level of nesting probably suffices. OTOH, I suspect it's more compatible with my code to try to do this truly recursively, diving into the subexpression with a new context.

{? and ?} are, of course, pathological in perl patterns and have to be escaped...who picked those delimiters? ;)

Yeah, yeah. I refuse to let my implementation language dictate my own query language -- that just lets me in for more annoyance in the long run.

(Although, truth to tell, the syntax Just Grew; there was never much design to it. Hence, I'm going to rethink it for Quirki, and decide whether I want to keep it or not...)

From:

metahacker.livejournal.com

You can construct patterns on the fly using variables, but that requires actual perl code; or use alternation plus grouping to return the key as well as the value:
/((faction|organization|company): (\w) )*/
will return things like "faction", "spark", "company", "heterodyne foundation", ...

From:

jducoeur

No, no -- this level knows nothing whatsoever of the names of these things, and indeed can't. They're *all* defined at the DB level: that's the point. It would defeat the purpose for the Perl code to have to know anything at all about either the values *or* the keys; moreover, the possible patterns are deliberately arbitrary. Remember, this expression is defining a database query *language*, not a specific kind of query; it's very general.

The actual expression is complex enough that trying to express it as a regex is pretty horrible, so here's a fast cut at a rough BNF (off the top of my head, and using curly braces instead of angle brackets so I don't have to escape all the HTML):

{page query} ::= "{?" {page descriptor} [{operator} {page descriptor}]* ":" {result display} "?}"
{page descriptor} ::= {property descriptor} | {inheritance descriptor}
{property descriptor} ::= {property name} ["==" {property value}]
{inheritance descriptor} ::= "~" {object name}
{operator} ::= "&&" | "||" | "&!"
{property name}, {property value}, {object name} 
         ::= semi-arbitrary phrases, including words and spaces

So it's already a pretty complex parse to begin with, but I have it working. The current challenge comes in the {result display}, which is completely arbitrary HTML with other types of queries and variable instantiations in it: I want to be able to embed nested {page query}s inside of it...

From:

jducoeur

Hmm. Not actually an option to use the new stuff, I'm afraid: all of this is running on my ISP, which tends to take its time about adopting new releases of Perl, so I have no idea when they'll pick it up. And I really want the feature *now*, for immediate work -- it would make it easier for me to create composite data-analysis pages, to see what's going on at a glance. Still, useful to know that this is coming down the pike; hopefully it'll propagate directly into Ruby.

Thanks for the pointer. That "match nested parens" section looks like just what I was looking for. Now I just have to sit down and understand exactly what it's doing.

Given that this is asking a regexp to act as a PDA it's not actually that simple...

Oh, absolutely. Indeed, the only reason I assumed that it was possible is that I keep hearing folks bragging on Perl's parsing capabilities, and this is *such* a common problem I figured there had to be a solution...

From:

baavgai.livejournal.com

Not a direct perl guru answer, but rather an observation that may help. Your question looks a whole lot like XSLT processing.

The methodology is that select="//X" returns a node set of all X elements. Further select="//X[@foo]" returns a node set that has attribute foo, and select="//X[@foo='bar']" returns a set where foo exists and equals the scalar bar.

This is a matching syntax and has nothing to do with the display syntax. The display is to work on elements X and show what you would of them. I don't know if this helps in the slightest, but it's what your post made me think of.

From:

jducoeur

Not directly, but it seems to be cognate with the correct perl answer, which I think

metahacker pointed to above...

From:

mindways

I *know* that you can't do what you want to do using VB's RegEx object. You can write a regex that will handle a specified depth of nesting, but it is not possible to handle the general case. (I had to deal with this at work.)

I *believe* that current Perl regex syntax is similarly constrained. (Haven't had a chance to follow and review the link posted above re: what's coming in a few weeks.)

However, if you're just looking for chunks enclosed by simple begin-delimiter/end-delimiter pairs, you don't necessarily need regexes - unless I'm grievously misremembering, you can find all such matching pairs and their contents (building up a parse tree of whatever sort you want) in a single beginning-to-end pass through the string. Not as elegant as a one-line regex, but not hideously complex, either, and it'll get the job done whereas the regex (currently) won't.

(Or you could use Lisp. Excessive refactoring to solve a small problem, ahoy! ;)

From:

jducoeur

Oh, I can definitely do my own recursive-descent parser -- for the next-gen system, I may well build one, although I haven't decided that yet. But for now, I don't want to invest a lot of effort in the older system (which is nearing the end of its life), so I don't particularly want to rewrite the parser from scratch if I can avoid it. Realistically, if it comes to that, I'll just live with the limitations for this game: there's nothing in GG that desperately *needs* this feature, it would just be handy.

So the question is mainly what I can do within the current architecture (inherited from the original code), which is regex-based. If I can find a relatively straightforward enhancement that gets me the nesting, I'll do that...

From:

asim.livejournal.com

This below assumes you can install modules...you've already seen (and figured out yourself) that the "pure" regex solution is hairy, at best.
I know you don't want to re-write the parser, but...have you looked at the Parse::RecDescent (http://search.cpan.org/~dconway/Parse-RecDescent-1.94/lib/Parse/RecDescent.pod) module?
Otherwise, since you don't want to store names (and you could write a routine that pulls the names for the patterns by just querying the DB), you might want to consider -- if I understand the problem alright -- storing the position, and "rewinding" manually as you go. Regex::Iterator (http://search.cpan.org/~simonw/Regex-Iterator-0.3/lib/Regex/Iterator.pm) has a "rewind" function that can help with that.

You might want to also look into the Text::Balanced (http://search.cpan.org/~dconway/Text-Balanced-v2.0.0/lib/Text/Balanced.pm) module.

Hope this helps!

From:

jducoeur

I had noticed RecDescent in passing, and was wondering about it. I suspect it's not worth that deep a rewrite right now (more precisely, I don't think I have *time* for that much of a rewrite right now), and I'm not 100% certain about how much module control I have, but it sounds like it's worth looking into. If nothing else, these might be useful libraries for the Querki rewrite. Thanks for the recommendations!

From:

http://users.livejournal.com/merle_/

In normal regular expressions, I don't believe you can. It's possible that some languages provide recursion, and Perl's /e modifier goes well beyond what a "normal" regexp can do. As

metahacker noted, if you know the number of levels, you can hardcode it, but it won't do an infinite number of levels of recursion.

I'm surprised, though, that Perl5 can match strings of validly nested parentheses. That was the canonical example in my coursework of "something a regexp cannot do, because a FSA cannot do it".

From:

learnedax.livejournal.com

The paren-matching in perl5 is basically the same mechanism as hardcoding a finite number of levels, it just makes clever use of inserting a variable to be expanded lazily as a regex, where that variable can be the main pattern itself.

Note,

jducoeur, that matching multi-character delimiters also complicates things a fair amount, as per the classic problem of matching C comments with regex.

From:

jducoeur

Yeah, but single-character really isn't a practical option. We're defining what amounts to a markup language here, so I need something odd enough to not generally appear in the text being marked up.

Really, the big question for the next generation (which is when all of this gets seriously reconsidered) is whether to stick with something like this (a C-ish symbolic language), or go over to the verbose side of the force and do it with XML. I go back and forth on that...

From:

learnedax.livejournal.com

Oh, I wasn't saying you should use a single-character delimiter, just providing another reason why the regex could be complicated to maintain - whereas a relatively simple (even actually iterative with pushdown) recursive descent would work fine. OTOH, it does look like you now have enough to push off a final decision on the parsing model.

Personally, I think verbosity would kill adoption. I'm also not sure that XML is a good metaphorical match for a query-and-render blob. Certainly the qualities you need to balance are simplicity and brevity.

From:

jducoeur

OTOH, it does look like you now have enough to push off a final decision on the parsing model.

Oh, none of this discussion had anything to do with what the *right* way to do it might be -- the plan is to scrap the whole parse stack for the Querki rewrite. This was mainly a question of whether there was a convenient enough short-term solution to get the functionality I'd like to have right now. (If there hadn't been, I simply would have done without.)

I think it's *very* likely that Querki will be based on a proper recursive-descent parser. But that'll be written from scratch, rather than hacked up from an older wiki.

Personally, I think verbosity would kill adoption.

Could be. The only reason I'm willing to contemplate it is that the intent is to do most query editing with a context-sensitive GUI. But I'll admit that the XML option doesn't thrill me: I'm mainly leaving it on the table to consider the options fairly until I have to make a decision. And it may be unwise to put too much weight behind the GUI idea until that's been proven, which argues for something easier to type.

(Really, the only problem with the current language is that it's not exactly a model of clarity, and neither are most of the similarly-concise options I've come up with...)

From:

mneme

the /e modifier isn't really relevant, becuase it's a modifier to the right-hand side of a substitution, not the regular expression.

This is actually a good textbook example of "things that are problematic in perl regular expressions. That said, perl regexpes -do- go far beyond textbook regexps, due to experimental constructs like (?{ code }) (embed perl code in a regexp) and more functionally, (??{ re }), etc (see the perlre manpage).

(??{re}) is probably a usable answer -- this worked well in my testcase, though it might need some tweaking:

my $re;
$re = qr/({\?\s*([^:]+)\s*:\s*((?:(??{$re})|.)*?)\s*\?})/;

($1 is the full matched expression, $2 is the left hand side, $3 is the right hand side).

From:

http://users.livejournal.com/merle_/

No, the /e modifier doesn't alter much in terms of the original problem. It does extend regular expressions way past what an FSA can do -- even a HMM can't simulate it.

I clearly need to read more about Perl5 regexps. Although perhaps not, because then I would be upset when the same functionality does not work in my editors...

From:

mneme

Only if you consider a substitute pattern a regular expression (as opposed to a match).

At least for me, the perl extended regular expressions (anything starting with (? ) avoid contaminating my memory, and I can use them without wanting to use them in emacs or whatnot. (But then, the big difference between perl-style regular expressions and many others -- that the magical meanings of parentheses and pipses are default, to be turned off by escaping them, not turned on, does help avoid confusion).

From:

http://users.livejournal.com/merle_/

Oh, I have very un-fond memories of the days of yore, when sed, grep, egrep, fgrep, and awk were all highly incompatible. *shudder*

From:

jducoeur

And it sounds like they're reconsidering some of the details for Perl 6...

From:

http://users.livejournal.com/merle_/

*shakes head* No wonder Perl 6 has been "in the wings" for several years now.

From:

jducoeur

Yep, that seems to be where it's all converging. Everything I've seen has indicated that the (??) operator is essential to making this work within the regex framework.

Thanks for providing a worked-out example, though -- I was preparing to spend the next couple of hours figuring that out myself, so this should speed up the process considerably...

From:

jducoeur

Yep -- plugged it in, and it worked right off the bat. (Revealed an unrelated bug that took me an hour to track down, but that had nothing to do with the regex.)

Thanks much!

From:

mneme

Excellent.

I'll note that while it's a pretty solid regexp, which -should- work in all or most cases, that I'd be very tempted to tighten it up were I to try to use it in production use, ie, something like:


  qr{ ( {\?\s*( [^:]+ ) \s* : \s* 
      ( (?: (??{$re} ) | [^?]+ | \?(?!}}+ })*)\s*\?}) 
  }x;

(which is to say, the alternation is "things matching this token, or things that can't close this token"). The problem with the original regexp is that it may be able to exhibit fairly bad worst-case alternation behavior, and regardless, isn't nearly as predictable as this one is in how it will match.

Flat | Top-Level Comments Only

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Delimiter matching in Perl?

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

Profile

June 2025

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags