javascript - For a regex pattern, how to determine the length of longest string that matchs the pattern? -


having regex pattern regexpattern, how can determine length of longest string matches regexpattern.

the imaginary int longestlength(string pattern) should work this:

assert.equals(longestlength("[abc]"), 1); assert.equals(longestlength("(a|b)c?"), 2);  assert.equals(longestlength("d*"), int.maxvalue); // or throws nolongestlengthexception 

although question in c#, both c# , javascript answers good.

it's pretty straightforward proper regex using operators ?, * , + , |, plus parentheses, character classes , of course ordinary characters. in fact \1-style backreferences (which aren't part of formal definition of regex, , complicate questions regexes) can handled without problems.

a regex compact encoding of tree structure (similar how mathematical formulas compact encodings of tree structures describing arithmetic). between every adjacent pair of characters there implicit "follows" operator corresponds node 2 children, 1 being subregex left, , other being entire rest of regex; sequence of subregexes separated | characters corresponds single "alt" node many children there alternatives (i.e., 1 more number of | characters), while every other operator has single child (namely subregex operates on), , every ordinary character or character class has no children @ all. calculate maximum-length matching string, can bottom-up traversal of tree structure, @ each node greedily assigning length of longest string match node, given knowledge of longest strings match children.

the rules deciding length of longest string matches given node in tree are:

  • follows(x, y) (xy): maxlen(x) + maxlen(y)
  • alt(a, b, ..., z) (a|b|...|z): max(maxlen(a), maxlen(b), ..., maxlen(z))
  • maybe(x) (x?): maxlen(x)
  • rep(x) (x*) or posrep(x) (x+): infinity
  • any other single character or character class ([...]): 1
  • \1-style backreferences: maxlen corresponding parenthesised expression

one consequence presence of * or + anywhere (unless escaped or part of character class, obviously) cause infinity propagate tree until hits root.

examples

regex: abcd "function call syntax": follows(a, follows(b, follows(c, d))) tree:  follows  /  \  follows     /  \    b  follows        /  \       c    d 

a second example:

regex: (a|b|de)c? "function call" syntax: follows(alt(a, b, follows(d, e)), maybe(c)) tree:       follows      /     \    alt     maybe   / | \        \   b follows   c        /  \       d    e 

for second regex/tree, bottom-up traversal assign maxlen of 1 leaf nodes a, b, d, e , c; maxlen bottom follows() node 1 + 1 = 2; maxlen alt() node max(1, 1, 2) = 2; maxlen maybe node 1; maxlen topmost follows() node, , entire regex, 2 + 1 = 3.

if mean regexes allow other perl-style enhanced features, might more complicated, because locally optimal choice of length may lead globally suboptimal one. (i had thought might possible perl-style extensions make regexes turing complete, meaning in general impossible decide whether there any matching string -- the discussion here seems indicate not case, unless of course allow in ?{...} construct.)


Comments

Popular posts from this blog

OpenCV OpenCL: Convert Mat to Bitmap in JNI Layer for Android -

android - org.xmlpull.v1.XmlPullParserException: expected: START_TAG {http://schemas.xmlsoap.org/soap/envelope/}Envelope -

python - How to remove the Xframe Options header in django? -