jducoeur: (Default)
[personal profile] jducoeur
Here's a question for the Perl and/or regexp experts in the audience; all help is solicited.

ProWiki has a query language built in. Simplifying greatly, the syntax looks like this:
{? [query terms] : [display results] ?}
This translates roughly as "for each page that matches the given query terms, show the display results, interpolating the properties of the page". That all works nicely, and is at the heart of what ProWiki does.

The problem is, I'd really like to be able to do this recursively. That is, I'd like to be able to construct a query like (to take today's example, one of many):
{?~Faction : %%Name%% -- {?~Character && Faction==%%PAGENAME%% : %%Name%% ?} ?}
That would translate as something like, "For each Faction, display the Faction's Name, and then for each Character in that Faction, display the Character's Name". Basically, nested foreach loops.

That's conceptually straightforward, but I'm stuck on how to parse it. ProWiki, being based on UseMod, uses Perl regex for its parsing. That mostly works fine, but I can't figure out how to get it to work recursively. I need to find the *matching* {? ?} pairs, extracting as plaintext any pairs that might be contained inside them. (The Perl code itself will then deal with the recursion into the plaintext subexpression.)

Can this be done straightforwardly in regex? It seems like a fairly common problem -- it's basically a fancy variant of parenthesis matching -- but I'm not hip enough to regex to see the answer. It's not simply a matter of matching first and last delimiters in the string, since a given page might contain several unrelated top-level expressions; therefore, I need to find the genuinely *matching* delimiters.

I know there are a bunch of Perl gurus out there, so if you can outline the solution to me (even the solution to the basic parenthesis-matching problem would probably show me how to do it), I'd be grateful. Thanks...

(no subject)

Date: 2007-01-02 06:32 pm (UTC)
mneme: (Default)
From: [personal profile] mneme
the /e modifier isn't really relevant, becuase it's a modifier to the right-hand side of a substitution, not the regular expression.

This is actually a good textbook example of "things that are problematic in perl regular expressions. That said, perl regexpes -do- go far beyond textbook regexps, due to experimental constructs like (?{ code }) (embed perl code in a regexp) and more functionally, (??{ re }), etc (see the perlre manpage).

(??{re}) is probably a usable answer -- this worked well in my testcase, though it might need some tweaking:

my $re;
$re = qr/({\?\s*([^:]+)\s*:\s*((?:(??{$re})|.)*?)\s*\?})/;

($1 is the full matched expression, $2 is the left hand side, $3 is the right hand side).


(no subject)

Date: 2007-01-02 06:45 pm (UTC)
From: [identity profile] http://users.livejournal.com/merle_/
No, the /e modifier doesn't alter much in terms of the original problem. It does extend regular expressions way past what an FSA can do -- even a HMM can't simulate it.

I clearly need to read more about Perl5 regexps. Although perhaps not, because then I would be upset when the same functionality does not work in my editors...

(no subject)

Date: 2007-01-02 08:03 pm (UTC)
mneme: (Default)
From: [personal profile] mneme
Only if you consider a substitute pattern a regular expression (as opposed to a match).

At least for me, the perl extended regular expressions (anything starting with (? ) avoid contaminating my memory, and I can use them without wanting to use them in emacs or whatnot. (But then, the big difference between perl-style regular expressions and many others -- that the magical meanings of parentheses and pipses are default, to be turned off by escaping them, not turned on, does help avoid confusion).

(no subject)

Date: 2007-01-02 09:09 pm (UTC)
From: [identity profile] http://users.livejournal.com/merle_/
Oh, I have very un-fond memories of the days of yore, when sed, grep, egrep, fgrep, and awk were all highly incompatible. *shudder*

(no subject)

Date: 2007-01-03 01:10 am (UTC)
From: [identity profile] http://users.livejournal.com/merle_/
*shakes head* No wonder Perl 6 has been "in the wings" for several years now.

(no subject)

Date: 2007-01-03 02:22 pm (UTC)
mneme: (Default)
From: [personal profile] mneme
Excellent.

I'll note that while it's a pretty solid regexp, which -should- work in all or most cases, that I'd be very tempted to tighten it up were I to try to use it in production use, ie, something like:


  qr{ ( {\?\s*( [^:]+ ) \s* : \s* 
      ( (?: (??{$re} ) | [^?]+ | \?(?!}}+ })*)\s*\?}) 
  }x;



(which is to say, the alternation is "things matching this token, or things that can't close this token"). The problem with the original regexp is that it may be able to exhibit fairly bad worst-case alternation behavior, and regardless, isn't nearly as predictable as this one is in how it will match.

Profile

jducoeur: (Default)
jducoeur

July 2025

S M T W T F S
  12345
6789101112
13141516171819
20212223242526
27 28293031  

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags