Reputation: 23
Is the regex of google DiffMatchPatch in form (A*|B*|C*|D*)* ?
private static final String TAG_REGEX = "<(\"[^\"]*\"|'[^']*'|[^'\">])*>"; //in case the texts being compared contain XML or HTML tags
If I understand correctly, it should for some input strings result in stack overflow, as described here http://www.regular-expressions.info/catastrophic.html , Regex gone wild: java.util.regex.Pattern matcher goes into high CPU loop ?
Can someone provide the example that will blow this regex ?
Upvotes: 1
Views: 102
Reputation: 56819
The regex won't run into catastrophic backtracking, since the regex is non-ambiguous.
However, if you are using Oracle JVM or OpenJDK, then the code above subjects to StackOverflowError
when there are a large number of characters outside single or double quoted strings, especially HTML with unquoted attributes.
For example:
String input = "<a href=https://www.google.com/search?tbs=sbi:AMhZZisZE7GxcvzYV20Dr9-lbMIKPcIj1un1cusubzWm8vKUA4pt3wWZ_1T8F5DynP15tfGTUTz5nqFoQWqWtde44qYBjMwpISXp2yCcuxHNiOBQX96Ol_1Ipf1w-BNNwEteGZUxl6HX4niNUAvVJzMqyfs3p0CVyF9Vf5_1NnYssdzVEsCBpJDYE-g759Icjtlm8vZXWvq3RUGTLS7FOrkyTWQ1Ym9COR1jtJzAbLMhJT_1bMHM5i6SfUpspG7nW_10-XiHnNlW_10D7ApaO10mLaAFdRPm1qVjLBP0-R9Fz0wgeyEad28JXxKY8xBuAoOM5oW5SwLibZYFdMv_1OwgsHSjXZI6xcW7O7vgpcDUA-r5PjCFOm4mxBie2I0QylHAE4fqWELskrpg6MRCBo4lcS9mAceC4PPGWnBQi95s4Bg2qidUHOZ6FzXoGTFPSBJ6C65vVpijAS48qwUdT0LnyGyR4HgVoyXztnSVd18yHsNOMyhYp7SO3-jYYSZFOEoE8cVtEbBUlPEBLr6VsKEBf8pqCGHUS4T7dy5tlOIMXdT93r4MjilMOLhQq286l8oZhp9rNC1AxKuHvxQE4sh5kTzEs41LhF0sMm2PYPRB-vxTTD3RnPt47AZAkTsVJQAtgv-3_1optn69_1IDz0GHn0dRGR3Sd5Ekcx4SHb5XHGFBDJ-ejoCZ3IWzSo_1KMadJrWNOGojHkeX0lVkVeCa1N_111oUWxCqJ8ejUZfwMi1t7AgU7gdMmPC58oUlnNhwYqWxkOFN6YujVlFA7jDerSSIRL-TKdIrlc1egyiMYVZ-vSvpmKF2N7Qm4hcUEdwr9VD87Q27R2EfhsczHflOBZ7ensbnVYTzy6TgVJu8Wd5BortukJ860HfyexenbUNM9PtdwEDzfiNblJmUcahpgOPvfQlZjPZniLoGRO2KhMCrglu3yKD8ndRZu_1_1u_1ID_1xtsoJwCziNPTvaE5n2-bypPlY_1hFcMHB8d0dlOqF2D-YJoZu2qmNZzKYm0XkAgdX9rUEo6JtJdytvS5b46JQGvfGcvKFfWxeqbxc3rvE3uUcW1yK-4l5iz8dzQ42PoWxENXV5J0lAP2apcTgS068tKxaw8OSw5W1AWMLNY0WSeI7YeLD59xUlMXAyBSR25FtQBmV86taHP_1IxeNTwNF8pz20LLxmDT6CsG9LoXEEHGUPzr5rM2DMcgUw3_16_1Rp0zw_1h_1Ju-lgJ5bMP2KDxrSEKZ7neLdqPAfejMiAUV3_1miOk_1cOs8Gu0QKizmWt9yX7_14O9yrE1YQ6193rVnUqv7HIZ--WNzOxiVKOYaE_1Hy4gseyS0vh32oGlQBJslmk_1jXCi346Ffa-F2_1e6zfyjo-rMYw3d_14SYFZqXZL8dtIXJNQliBtsfn7dCInzkegkrhYRFcyLhppMWmZDurwMNhKQYbEZLC9lh8OabruxSTH1gkaDi0LlW0SpzJsXSwXSwgaRzFVTVNk2op4vPtVFs7NwD_187w2fDNJcIkkaIQ3cXQeQtzmhdvhLqhrudx9D7nAwdNwFM0gD2PVV2kUmc6YTJ6_1UQo2jS4ENycmXgg&btnG=Search%20by%20image&hl=en>";
The cause of the issue is analyzed in this answer of mine.
One way to fix this issue is to make the outer repetition possessive:
"<(\"[^\"]*\"|'[^']*'|[^'\">])*+>"
^^
The possessive quantifier is implemented with a loop, so there will be no StackOverflowError
.
Another way to fix this issue is to match unquoted characters [^'\">]
as many as possible in a single iteration of outer repetition:
"<(\"[^\"]*\"|'[^']*'|[^'\">]++)*>"
^^
Possessive quantifier is mandatory in this case, since the regex will have catastrophic backtracking otherwise.
Upvotes: 1