Reputation: 4991
Given a huge list (over a thousand entry of random numbers from 00001 to 60031) what's the best way to match every single numbers and only those numbers.
A sample of the list 00001,00002,00003,00004,00006,00010,00016,00030,00039,00177,00187,00219,00239,00240,00241,00242,00245,00248,00250,00258,00260,....,20065,20069,20070,...,27005,27007,28006,29000,29400,30100,......
I know I could brute force it with a super long | statements and maybe a few tricks like range 0000[1-6]
but is there anything smarter? Yes I'm aware I could stick it in a script like python or like a database and match it accordingly but I have to do this in regex since the system is built that way.
What I have so far: (0000[1-46])|0001[06]|0003[09]|001[78]7|002[136]9|0024[0-58]|....
Upvotes: 0
Views: 254
Reputation:
Get this app http://www.regexformat.com and from the tools menu, run
Strings to Regex - Ternary tree
.
Drop your entire list (thousands) into the input box, select comma as delimiter
then generate.
It will create a regex that takes less than 5 steps to find any number
in the list and is extremely fast.
Screenshot and example strings 175,000 word Dictionary
The numbers are no more than the words you are searching for.
Regex from the sample you provided:
(?<!\d)(?:00(?:0(?:0(?:1|2|3|4|6)|1(?:0|6)|3(?:0|9))|1(?:77|87)|2(?:19|39|4(?:0|1|2|5|8)|5(?:0|8)|60))|2(?:00(?:6(?:5|9)|70)|700(?:5|7)|8006|9(?:000|400))|30100)(?!\d)
Expanded
(?<! \d )
(?:
00
(?:
0
(?:
0
(?: 1 | 2 | 3 | 4 | 6 )
| 1
(?: 0 | 6 )
| 3
(?: 0 | 9 )
)
| 1
(?: 77 | 87 )
| 2
(?:
19
| 39
| 4
(?: 0 | 1 | 2 | 5 | 8 )
| 5
(?: 0 | 8 )
| 60
)
)
| 2
(?:
00
(?:
6
(?: 5 | 9 )
| 70
)
| 700
(?: 5 | 7 )
| 8006
| 9
(?: 000 | 400 )
)
| 30100
)
(?! \d )
edit:
Adding a Benchmark sample to test performance.
Showing the results of using this regex with/without boundaries:
Note that almost all full trie regex generated using a ternary tree
will bench about the same, doesn't matter the content too much.
Regex1: (?<!\d)(?:00(?:0(?:0(?:1|2|3|4|6)|1(?:0|6)|3(?:0|9))|1(?:77|87)|2(?:19|39|4(?:0|1|2|5|8)|5(?:0|8)|60))|2(?:00(?:6(?:5|9)|70)|700(?:5|7)|8006|9(?:000|400))|30100)(?!\d)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 30
Elapsed Time: 1.75 s, 1747.23 ms, 1747235 µs
Regex2: (?:00(?:0(?:0(?:1|2|3|4|6)|1(?:0|6)|3(?:0|9))|1(?:77|87)|2(?:19|39|4(?:0|1|2|5|8)|5(?:0|8)|60))|2(?:00(?:6(?:5|9)|70)|700(?:5|7)|8006|9(?:000|400))|30100)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 30
Elapsed Time: 1.13 s, 1129.65 ms, 1129650 µs
(50,000 iter) * (30 matches/iter) = 1,500,000 matches
-------------------------------------------------------------------------------------
Regex1 with boundary: 1,500,000 matches / 1.75 seconds = 857,143 matches/second
Regex2 no boundary: 1,500,000 matches / 1.13 seconds = 1,327,434 matches/second
Upvotes: 1
Reputation: 542
Try this ...
^(?:[0-5][0-9]{4}|600[0-2][0-9]|6003[01])(?<!00000)$
The first alternate in the non-capturing group catches most numbers by matching exactly 5 digits requiring the first digit to be 5 or smaller (note that this matches 00000). The next two alternates deal with 60000-60031. The second group is a negative lookbehind and is there to exclude 00000 which would otherwise match.
Depending on the technology you are using the negative lookbehind might not be an option. PHP and python I believe support it and Java definitely does ... JavaScript I think does not but perhaps you could check that one value with a secondary check?
Upvotes: 0