jducoeur

Here's a question for the Perl and/or regexp experts in the audience; all help is solicited.

ProWiki has a query language built in. Simplifying greatly, the syntax looks like this:

{? [query terms] : [display results] ?}

This translates roughly as "for each page that matches the given query terms, show the display results, interpolating the properties of the page". That all works nicely, and is at the heart of what ProWiki does.

The problem is, I'd really like to be able to do this recursively. That is, I'd like to be able to construct a query like (to take today's example, one of many):

{?~Faction : %%Name%% -- {?~Character && Faction==%%PAGENAME%% : %%Name%% ?} ?}

That would translate as something like, "For each Faction, display the Faction's Name, and then for each Character in that Faction, display the Character's Name". Basically, nested foreach loops.

That's conceptually straightforward, but I'm stuck on how to parse it. ProWiki, being based on UseMod, uses Perl regex for its parsing. That mostly works fine, but I can't figure out how to get it to work recursively. I need to find the *matching* {? ?} pairs, extracting as plaintext any pairs that might be contained inside them. (The Perl code itself will then deal with the recursion into the plaintext subexpression.)

Can this be done straightforwardly in regex? It seems like a fairly common problem -- it's basically a fancy variant of parenthesis matching -- but I'm not hip enough to regex to see the answer. It's not simply a matter of matching first and last delimiters in the string, since a given page might contain several unrelated top-level expressions; therefore, I need to find the genuinely *matching* delimiters.

I know there are a bunch of Perl gurus out there, so if you can outline the solution to me (even the solution to the basic parenthesis-matching problem would probably show me how to do it), I'd be grateful. Thanks...

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Most Popular Tags

adobemax - 14 uses
arisia - 38 uses
artofconv - 14 uses
business - 13 uses
cats - 50 uses
comics - 51 uses
commyou - 131 uses
covid-19 - 26 uses
dance - 12 uses
diary - 731 uses
ebooks - 14 uses
economics - 33 uses
economy - 28 uses
europe 2012 - 14 uses
exercise - 30 uses
games - 29 uses
geekery - 14 uses
girl genius - 28 uses
house - 37 uses
humor - 21 uses
jane - 108 uses
kickstarter - 13 uses
larp - 59 uses
law - 28 uses
links - 49 uses
lj - 28 uses
masonry - 17 uses
media - 28 uses
meme - 15 uses
memes - 29 uses
news - 17 uses
patents - 13 uses
pennsic - 36 uses
politics - 353 uses
programming - 476 uses
querki - 107 uses
recipes - 30 uses
reviews - 204 uses
sca - 326 uses
scala - 40 uses
science - 13 uses
spam - 21 uses
technology - 531 uses
troob - 27 uses
vacation - 54 uses
vegas - 12 uses
via ljapp - 13 uses
wartime thoughts - 34 uses
wave - 23 uses
work - 92 uses

Flat | Top-Level Comments Only

From:

http://users.livejournal.com/merle_/

In normal regular expressions, I don't believe you can. It's possible that some languages provide recursion, and Perl's /e modifier goes well beyond what a "normal" regexp can do. As

metahacker noted, if you know the number of levels, you can hardcode it, but it won't do an infinite number of levels of recursion.

I'm surprised, though, that Perl5 can match strings of validly nested parentheses. That was the canonical example in my coursework of "something a regexp cannot do, because a FSA cannot do it".

learnedax.livejournal.com

The paren-matching in perl5 is basically the same mechanism as hardcoding a finite number of levels, it just makes clever use of inserting a variable to be expanded lazily as a regex, where that variable can be the main pattern itself.

Note,

jducoeur, that matching multi-character delimiters also complicates things a fair amount, as per the classic problem of matching C comments with regex.

jducoeur

Yeah, but single-character really isn't a practical option. We're defining what amounts to a markup language here, so I need something odd enough to not generally appear in the text being marked up.

Really, the big question for the next generation (which is when all of this gets seriously reconsidered) is whether to stick with something like this (a C-ish symbolic language), or go over to the verbose side of the force and do it with XML. I go back and forth on that...

Oh, I wasn't saying you should use a single-character delimiter, just providing another reason why the regex could be complicated to maintain - whereas a relatively simple (even actually iterative with pushdown) recursive descent would work fine. OTOH, it does look like you now have enough to push off a final decision on the parsing model.

Personally, I think verbosity would kill adoption. I'm also not sure that XML is a good metaphorical match for a query-and-render blob. Certainly the qualities you need to balance are simplicity and brevity.

OTOH, it does look like you now have enough to push off a final decision on the parsing model.

Oh, none of this discussion had anything to do with what the *right* way to do it might be -- the plan is to scrap the whole parse stack for the Querki rewrite. This was mainly a question of whether there was a convenient enough short-term solution to get the functionality I'd like to have right now. (If there hadn't been, I simply would have done without.)

I think it's *very* likely that Querki will be based on a proper recursive-descent parser. But that'll be written from scratch, rather than hacked up from an older wiki.

Personally, I think verbosity would kill adoption.

Could be. The only reason I'm willing to contemplate it is that the intent is to do most query editing with a context-sensitive GUI. But I'll admit that the XML option doesn't thrill me: I'm mainly leaving it on the table to consider the options fairly until I have to make a decision. And it may be unwise to put too much weight behind the GUI idea until that's been proven, which argues for something easier to type.

(Really, the only problem with the current language is that it's not exactly a model of clarity, and neither are most of the similarly-concise options I've come up with...)

mneme

the /e modifier isn't really relevant, becuase it's a modifier to the right-hand side of a substitution, not the regular expression.

This is actually a good textbook example of "things that are problematic in perl regular expressions. That said, perl regexpes -do- go far beyond textbook regexps, due to experimental constructs like (?{ code }) (embed perl code in a regexp) and more functionally, (??{ re }), etc (see the perlre manpage).

(??{re}) is probably a usable answer -- this worked well in my testcase, though it might need some tweaking:

my $re;
$re = qr/({\?\s*([^:]+)\s*:\s*((?:(??{$re})|.)*?)\s*\?})/;

($1 is the full matched expression, $2 is the left hand side, $3 is the right hand side).

No, the /e modifier doesn't alter much in terms of the original problem. It does extend regular expressions way past what an FSA can do -- even a HMM can't simulate it.

I clearly need to read more about Perl5 regexps. Although perhaps not, because then I would be upset when the same functionality does not work in my editors...

Only if you consider a substitute pattern a regular expression (as opposed to a match).

At least for me, the perl extended regular expressions (anything starting with (? ) avoid contaminating my memory, and I can use them without wanting to use them in emacs or whatnot. (But then, the big difference between perl-style regular expressions and many others -- that the magical meanings of parentheses and pipses are default, to be turned off by escaping them, not turned on, does help avoid confusion).

Oh, I have very un-fond memories of the days of yore, when sed, grep, egrep, fgrep, and awk were all highly incompatible. *shudder*

And it sounds like they're reconsidering some of the details for Perl 6...

*shakes head* No wonder Perl 6 has been "in the wings" for several years now.

Yep, that seems to be where it's all converging. Everything I've seen has indicated that the (??) operator is essential to making this work within the regex framework.

Thanks for providing a worked-out example, though -- I was preparing to spend the next couple of hours figuring that out myself, so this should speed up the process considerably...

Yep -- plugged it in, and it worked right off the bat. (Revealed an unrelated bug that took me an hour to track down, but that had nothing to do with the regex.)

Thanks much!

Excellent.

I'll note that while it's a pretty solid regexp, which -should- work in all or most cases, that I'd be very tempted to tighten it up were I to try to use it in production use, ie, something like:


  qr{ ( {\?\s*( [^:]+ ) \s* : \s* 
      ( (?: (??{$re} ) | [^?]+ | \?(?!}}+ })*)\s*\?}) 
  }x;

(which is to say, the alternation is "things matching this token, or things that can't close this token"). The problem with the original regexp is that it may be able to exhibit fairly bad worst-case alternation behavior, and regardless, isn't nearly as predictable as this one is in how it will match.

Delimiter matching in Perl?

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

Profile

July 2025

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags