javascript - For a regex pattern, how to determine the length of longest string that matchs the pattern? -
having regex pattern regexpattern
, how can determine length of longest string matches regexpattern
.
the imaginary int longestlength(string pattern)
should work this:
assert.equals(longestlength("[abc]"), 1); assert.equals(longestlength("(a|b)c?"), 2); assert.equals(longestlength("d*"), int.maxvalue); // or throws nolongestlengthexception
although question in c#, both c# , javascript answers good.
it's pretty straightforward proper regex using operators ?
, *
, +
, |
, plus parentheses, character classes , of course ordinary characters. in fact \1
-style backreferences (which aren't part of formal definition of regex, , complicate questions regexes) can handled without problems.
a regex compact encoding of tree structure (similar how mathematical formulas compact encodings of tree structures describing arithmetic). between every adjacent pair of characters there implicit "follows" operator corresponds node 2 children, 1 being subregex left, , other being entire rest of regex; sequence of subregexes separated |
characters corresponds single "alt" node many children there alternatives (i.e., 1 more number of |
characters), while every other operator has single child (namely subregex operates on), , every ordinary character or character class has no children @ all. calculate maximum-length matching string, can bottom-up traversal of tree structure, @ each node greedily assigning length of longest string match node, given knowledge of longest strings match children.
the rules deciding length of longest string matches given node in tree are:
- follows(x, y) (
xy
): maxlen(x) + maxlen(y) - alt(a, b, ..., z) (
a|b|...|z
): max(maxlen(a), maxlen(b), ..., maxlen(z)) - maybe(x) (
x?
): maxlen(x) - rep(x) (
x*
) or posrep(x) (x+
): infinity - any other single character or character class (
[...]
): 1 \1
-style backreferences: maxlen corresponding parenthesised expression
one consequence presence of *
or +
anywhere (unless escaped or part of character class, obviously) cause infinity propagate tree until hits root.
examples
regex: abcd "function call syntax": follows(a, follows(b, follows(c, d))) tree: follows / \ follows / \ b follows / \ c d
a second example:
regex: (a|b|de)c? "function call" syntax: follows(alt(a, b, follows(d, e)), maybe(c)) tree: follows / \ alt maybe / | \ \ b follows c / \ d e
for second regex/tree, bottom-up traversal assign maxlen of 1 leaf nodes a, b, d, e , c; maxlen bottom follows() node 1 + 1 = 2; maxlen alt() node max(1, 1, 2) = 2; maxlen maybe node 1; maxlen topmost follows() node, , entire regex, 2 + 1 = 3.
if mean regexes allow other perl-style enhanced features, might more complicated, because locally optimal choice of length may lead globally suboptimal one. (i had thought might possible perl-style extensions make regexes turing complete, meaning in general impossible decide whether there any matching string -- the discussion here seems indicate not case, unless of course allow in ?{...}
construct.)
Comments
Post a Comment