user3601725
user3601725

Reputation: 515

Can a regex be 'reverse' engineered to output each of it's patterns into a list or table?

I have some complex regex patterns that are used by a php app that utilizes them in turn for some complex logic output. I'd like to do some testing of the constituent parts so I can identify what fails, in other words is it a regex and if so which part of the regex as I've found some work and some don't

So can I take for instance:

and have it processed to return a list/table such as:

So I can then run a test on each output against the regex to see which returns a value? Clearly in the example all would work but the regex's I have have many 'pipes' and branches and I'd just like to get the entire list and be able to test to make sure ALL of them work.

Upvotes: 3

Views: 1512

Answers (3)

jgmjgm
jgmjgm

Reputation: 4695

You can generate possible matches but it's not a good way to test regex. Whether you can generate all possible matches depends on the regex. I don't believe it's the problem people make it out to be though as you can limit output in a variety of ways. A regex pattern can easily be infinite or have a huge number of permutations. This means you need to do something for those cases.

I debug my regular expressions by reading more carefully, checking the manual, running tests on the command line and breaking them up into parts if I have to.

I don't know of any libraries for generating possible regex matches given a pattern. It's not very useful for testing unless you want something such as to create the first and last possible match. That or unless you have one of those giant single line huge nesting regexes such as the one that parses HTML or the even more complicated one for parsing email addresses. A lot of times people try to pack things into one regex that are better spread out for readability and reusability.

If you generate the first possible match, that might be useful for testing. You can also potentially extend that and not normally get too many results. It should not have problems with infinite patterns.

An infinite pattern will happen with +, *, {n,}. You can still have very large patterns without using any of those. {hugenumber,} is an obvious example. [^something] can also create a lot of permutations as well as [a-z]{8} which is 26^8. Most people before the regex do a count of the length so you can and should use that. It is difficult though as the length is distributed across each infinite operator and makes what's below a lot harder (unless you short circuit at the max length but you have to go far down a tree sometimes or have to traverse it first to work out those bounds if you do it like that). If you have a regex like this:

 a+b+c+

With a maximum length of 5 it expands to:

 a{3}b{1}c{1}
 a{1}b{3}c{1}
 a{1}b{1}c{3}
 a{2}b{2}c{1}
 a{2}b{1}c{2}

That'll increase exponentially as the max length increases.

If you have +, *, {n,} etc a cheap work around is to change + to {1,max}, * to {0,max} and {n,} to {n,max} where max is something reasonable. In reality, it's pretty rare you would expect an infinite string, there's got to be some reasonable limit, even if really high. It's not to max it so you can print your entire range but so that you at least have an upper bound to work with that's workable. You can still work somewhat with inf but it's not worth it.

I don't do this to test regex. Instead it works better as a generator for sequences. This can be used for password cracking, code generation, boiler plate generation, automation, etc. Care has to be taken to make sure the use is justified and not creating too much bloat. As in if you think you need to create a huge sequence for something, you might be using this as a way to achieve that level of bloat when there might be another approach to the problem.

Regex is under utilised in my opinion for generating ranges. For example, not many languages would let you express a range of all padded numbers of a given length with as little as:

 \d{6}

This will generate 000000 to 999999. That's a million entries times 6 characters so 6MB at least. You can see with this that you'll rapidly create something that consumes a lot of memory. Wrapping the regex processor in an iterator is a good idea, especially one that can skip positions which should be possible with regex. To handle infinite sequences or huge, it's best to have a way to configure an implicit limit or maximum.

To start you'll want to parse your regex. You can do this yourself but if you want it for testing you'll want to use the parser for your specific platform and that is likely to be difficult as the internals are unlikely to be exposed to higher level languages. A generic parser can still be of use some as long as you're not doing anything too fancy.

If you want to make something that will support all kinds of complex possibilities in regex such as backreferences, etc then I would consider abandoning hope now.

Once you process it, the handing is quite simple:

 (Bob|Robert) (&|and) (Sally|Jane)

 [Bob,Robert] * [ ] * [&,and] * [ ] * [Sally,Jane]

It's the Cartesian products of the unions. It's a bit more complex when nested further.

 abc?[zy]((xy|yz)(ab|ba)|xx)
 [ [ab, abc] * [z, y] * [ [ [xy, yz] * [ab, ba] ], [xx] ] ]
 ab * [, c] * [z, y] * [ [xy, yz] * [ab, ba], xx ]

The second alternative is if you explicitly encode the empty set which might be recommended to avoid bloat. You can represent the empty set a few ways. [, c] would break if [] but I'm not sure that [] is possible except for an empty regex or strange cases of malformed regex. You might also want to watch out for ambiguous regexes.

First match is just take the first of each range:

 [ [ab] * [z] * [ [xy] * [ab] ] ]
 [ [ab] * [z] * [xy] * [ab] ]
 [ ab * z * xy * ab ]
 [ ab, z, xy, ab ]
 abzxyab

Reduce all down to one element, you can also do other indexes like random, as long as it reduces down to one. When all sets are length 1, the cartesian product operation becomes an append operation. I've used * instead of x for cartesian product as it is less confusing mixed with alpha characters.

You can calculate the number of combinations as such:

 1 * 2 * 2 * [ 2 * 2 + 1 ]
 2 * 2 * [ 2 * 2 + 1 ]
 4 * 5
 20

That is, you reduce each range to it's length. At this point that * operator becomes a real multiplication operator rather than cartesian product. A separator becomes a plus. You can see with this multiplication that the number of combinations can grow very quickly.

In your case it's 2*2*2 (8).

However, if you use the i flag for case insensitivity, it becomes:

 (2^3 + 2^6) * (1 + 2^3) * (2^5 + 2^4)
 31104

You can see that an AST for basic regex can be quite easy to build. It's nested arrays (arbitrary dimensions) with arbitrary length.

If you don't have infinite matches or have an implicit limit then you can convert everything into iterators. You'll want range iterators for things such as a-z.

The easiest way is to have aggregate iterators using an iterator interface. In your high level language you should have an easy way to wrap arrays into an iterator interface if they don't automatically implement one. Your high level language will likely provide an aggregate iterator as well.

You can recursively wrap each node in the tree like that (depth first is easiest) and then have a single iterator at the top that will represent 0 to the total number of permutations in your regex.

That's the simplest approach but if you dive in further and make more complex abstraction it's also possible dereference any part or create masks, etc. At this point it's not as simply. You might want to try variants in a certain order, etc.

A trick to work against an existing parser is to brute force. It's not a perfect art but can work in simply cases. If you want to resolve something like [a-zA-Z] you can try against your regex engine with all characters to see which match. This is not good at all when it comes to different character sets though. Only ascii regex. Character set is something important to consider. In unicode two dots alone can lead to millions of combinations.

Contrary to the impression being given here, while a regex expression can represent a very large search space, you do not have to explore all of it or generate all of it. You can map the regex to its parameters and jump to any specific part of it. You do not have to generate the entire search space to explore it.

Much of this you should be able to do in your head, as it is required to read and write regex that you understand. Your problem here is likely your white space. Too many people ignore whitespace as unimportant and fall into that trap. I recommend you set your IDE to subtly display whitespace.

Example from script I use that utilise the same concepts as above:

 AST (simplified, ors flattened, single child scopes flattened parent, adjacent literals merged, consecutive ors into ranges):
 ["scope",
     ["literal", "x"],
     ["repeat", [0, 3], ["literal",  "y"]],
     ["literal", "zfff"],
     ["range", "a", "d"]
     ["repeat",[4, 5], ["range", "0", "9"]],
     ["literal", "n"]
 ]
 Expression:                 xy{,3}z(fff)(a|b|c|d)\d{4,5}n
 Combinations:               1760000
 Position 0:                 xzfffa0000n
 Position 1:                 xzfffa0001n
 Position 2:                 xzfffa0002n
 Position Middle -1:         xyzfffd99999n
 Position Middle:            xyyzfffa0000n
 Position Middle + 1:        xyyzfffa0001n
 Position Last:              xyyyzfffd99999n

Not that after a while regex patterns for matching diverges from regex for pattern generation. You'll not need many of the complex features in regex but you will want to add complex features not found in it.

This one demonstrates that you don't need to traverse the whole search space (it would be impossible to run on anything but the most powerful super computers otherwise):

AST (full):
["scope",
    ["repeat", [9],
        ["or",
            ["scope", ["range", "a", "z"]],
            ["scope", ["range", "A", "Z"]]
        ]
    ]
]
Expression:                  (\l|\L){9}
Combinations:                2779905883635712
Position 0:                  aaaaaaaaa
Position 1:                  aaaaaaaab
Position 2:                  aaaaaaaac
Position Middle -1:          zZZZZZZZZ
Position Middle:             Aaaaaaaaa
Position Middle + 1:         Aaaaaaaab
Position Last:               ZZZZZZZZZ

Note that without arbitrary precision integers much larger than this will break.

You can map your AST to objects to make life easier, where the first parameter is the object type and the others are the initialisation parameters. Your AST is almost like a configuration for dependency injection.

Here is the result for your regex:

 AST (full):
 ["scope",
     ["or", ["scope", ["literal", "Bob"]], ["scope", ["literal", "Robert"]]],
     ["literal", " "],
     ["or", ["scope", ["literal", "&"]], ["scope", ["literal", "and"]]],
     ["literal", " "],
     ["or", ["scope", ["literal", "Sally"]], ["scope", ["literal", "Jane"]]]
 ]
 Expression:                 (Bob|Robert) (&|and) (Sally|Jane)
 Combinations:               8
 Position 0:                 Bob & Sally
 Position 1:                 Bob & Jane
 Position 2:                 Bob and Sally
 Position Middle -1:         Bob and Jane
 Position Middle:            Robert & Sally
 Position Middle + 1:        Robert & Jane
 Position Last:              Robert and Jane

There a lot of other things you can do like give it the start of a regex and have it reverse that to an initial starting index.

Be wary of naysayers.

Upvotes: 0

briantist
briantist

Reputation: 47792

I think Ed Cottrell answered your question well as stated.

My interpretation is that your ultimate goal is to make it easier to read/interpret/debug your expression.

To that end, maybe you could use free spacing mode.

That would let you separate out groupings and sub expressions onto separate lines, complete with comments.

Upvotes: 1

elixenide
elixenide

Reputation: 44833

Don't do this. For one thing, it's nearly impossible. It's literally impossible for an open-ended expression like .+ or \d+. Bob.+Jane, for example, matches infinitely many strings. So do many more subtle examples, like Hello World!+ or even \d{3,} (three or more digits).

In other words, identifying all possibilities is going to be extremely difficult or downright impossible for anything other than a trivial regex. For example, generating all possibilities for .{16} (that is, all 16-character strings) would result in 3.4 * 10^38 possibilities to check. And that's if you limit yourself to ASCII characters. If you check 1,000,000 strings per second, it will take 1.07 * 10^25 years, or about 13,820,512,820 times the estimated age of the universe. Also, good luck finding a hard drive to hold that much data. You would need to convert a large chunk of the earth into binary storage.

A better solution is to generate a bunch of realistic strings that you might actually encounter, then create unit tests using them. As you go forward, you may find additional cases that should work but don't. So, write a new test, then revise the regex until all tests (old and new) pass.

Upvotes: 3

Related Questions