Reputation: 127
For testing purposes on a project I'm working on, I have a need to, if given a regular expression, randomly generate a string that will FAIL to be matched by it. For instance, if I'm given this regex:
^[abcd]d+
Then I should be able to generate strings such as:
hnbbad
uduebbaef
9f8;djfew
skjcc98332f
...each of which does NOT match the regex, but NOT generate:
addr32
bdfd09usdj
cdddddd-9fdssee
...each of which DO. In other words, I want something like an anti-Xeger.
Does such a library exist, preferably in Python (if I can understand the theory, I can most likely convert it to Python if need be)? I gave some thought to how I could write this, but given the scope of regular expressions, it seemed that might be a much harder problem than what things like Xeger can tackle. I also looked around for a pre-made library to do this, but either I'm not using the right keywords to search or nobody's had this problem before.
Upvotes: 7
Views: 2195
Reputation: 1
Can we reduce the infinite number of possibilities, by restricting to generate strings from a give character set.
For example, I can define the character set, [QWERTYUIOP!@#$%^%^&*))_]
and all the strings I generate randomly should be born from this set. That way we can reduce the infinite nature of this problem?
In fact even I am looking for a utility like this, preferably in Python.
Upvotes: 0
Reputation: 34395
No this is impossible. There are an infinite number of regexes that match every string in the known universe. For example:
/^/
/.*/
/[^"\\]*(\\.[^"\\]*)*$/
etc.
This is because all these regexes can match nothing at all (which is something all strings have!)
Upvotes: 2
Reputation: 838706
My initial instinct is, no, such a library does not exist because it's not possible. You can't be sure that you can find a valid input for any arbitrary regular expression in a reasonable amount of time.
For example, proving whether a number is prime is believed to be a hard to solve mathematical problem. The following regular expression matches any string which is at least 10000 characters long and whose total length is a prime number:
(?!(..+)\1+$).{10000}
I doubt that any library exists that can find a valid input to this regular expression in reasonable time. And this is a very easy example with a simple solution, e.g. 'x' * 10007
will work. It would be possible to come up with other regular expressions that are much harder to find valid inputs for.
I think the only way you are going to solve this is if you limit yourself to some subset of all possible regular expressions.
But having said that if you have a magical library that generates text that matches for any arbitrary regular expression then all you need to do is generate a regular expression that matches all the strings that don't match your original expression.
Luckily this is possible using a negative lookahead:
^(?![\s\S]*(?:^[abcd]d+))
If you are willing to change the requirements to only allow a limited subset of regular expressions then you can negate the regular expression by using boolean logic. For example if ^[abcd]d+
becomes ^[^abcd]|^[abcd][^d]
. It is then possible to find a valid input for this regular expression in reasonable time.
Upvotes: 6
Reputation: 2423
I would do a loop, generating random combinations of random length, and test if matches the regexp. Repeat the loop until a not-match situation is reached.
Obviously, this would be inefficient. Are you sure you cannot invert the regexp and generate a match on the inverted regexp?
Upvotes: 3