Reputation: 2089
I'm trying to a find a way to determine whether a string contains at least n number of character in a specific order.
I am processing an enormous amount of data written by hand and the amount of typos is pretty crazy.
I need to find text parts in a large string looking something like:
irrelevant text MONKEY, CHIMP: more irrelevant text
I need to find MONKEY, CHIMP:
The ways this is mistyped is pretty crazy. Here is an extra weird example:
MonKEY , CHIMp :
I've got to a point in my regex where I'm able to find all of these occurances. Probably not the nicest solution, but here it is:
(m|M)(o|O)(n|N)(k|K)(e|E)(y|Y),?\s+(c|C)(h|H)(i|I)(m|M)(p|P)(\s+)?:
Looks a bit weird but it works.
Unfortunately the weirdness does not stop here. I need to amend this regex so that it also allows for 1 missing letter in each word.
So I would need to amend this regex so it would also work for something like:
MonKEY , CIMp :
onKEY , ChIMp :
onKEY , CIMp :
I would think that there should be a way to tell the regex that it should require wordlength-1 exact number of characters to match.
Is there a simple way to do this?
I'm been looking into {4, } but I'm not sure this is the right direction or if it could be applied here.
Thank in advance, Peter
Upvotes: 0
Views: 2067
Reputation: 28305
With pure regex, then best you could do is something like (whitespace added for readability):
/
^
(
monkey\s*,?\s*chimp\s*:
|
onkey\s*,?\s*chimp\s*:
|
mnkey\s*,?\s*chimp\s*:
|
...
)
$
/ix
However, this is a very long-winded approach and still won't account for all sorts of other fuzzy-matches like "Monkey, Chinp:"
or "Monkey; Chimp:"
.
An alternative approach you could take is to first check the length of the string:
/^\w{10,15}$/
and then perform some very-fuzzy match on it:
/m?o?n?k?e?y?\s*,?\s*c?h?i?m?p?\s*:/i
However, you'd need to be careful here since there may be some bizarre results included in the match list, such as:
"mon c:"
I would recommend taking a different, non-regex approach of utilising a Levenshtein Distance library. This will allow you to set generic boundaries on "how closely the string needs to match Monkey, Chimp
"
Upvotes: 2
Reputation: 54148
You can use regex like this, this is not very beautiful but your example is strange too
First use case insensitive :(https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#CASE_INSENSITIVE)
I don't know solution in one treatment but you can first check for "m?o?n?k?e?y?\s+,?\s+c?h?i?m?p?(\s+)?:"
and then for length in another test, this will be easy
Upvotes: 0
Reputation: 315
^\w{10,10}$ # allows words of exactly 10 characters. Set it to length - 1. Then make each of the characters optional.
I think just {10} works as well.
Upvotes: 0