Peter Jaloveczki
Peter Jaloveczki

Reputation: 2089

Regex with allowing missing characters

I'm trying to a find a way to determine whether a string contains at least n number of character in a specific order.

I am processing an enormous amount of data written by hand and the amount of typos is pretty crazy.

I need to find text parts in a large string looking something like:

irrelevant text MONKEY, CHIMP: more irrelevant text

I need to find MONKEY, CHIMP:

The ways this is mistyped is pretty crazy. Here is an extra weird example:

MonKEY , CHIMp :

I've got to a point in my regex where I'm able to find all of these occurances. Probably not the nicest solution, but here it is:

 (m|M)(o|O)(n|N)(k|K)(e|E)(y|Y),?\s+(c|C)(h|H)(i|I)(m|M)(p|P)(\s+)?:

Looks a bit weird but it works.

Unfortunately the weirdness does not stop here. I need to amend this regex so that it also allows for 1 missing letter in each word.

So I would need to amend this regex so it would also work for something like:

MonKEY , CIMp :

onKEY , ChIMp :

onKEY , CIMp :

I would think that there should be a way to tell the regex that it should require wordlength-1 exact number of characters to match.

Is there a simple way to do this?

I'm been looking into {4, } but I'm not sure this is the right direction or if it could be applied here.

Thank in advance, Peter

Upvotes: 0

Views: 2067

Answers (3)

Tom Lord
Tom Lord

Reputation: 28305

With pure regex, then best you could do is something like (whitespace added for readability):

/
  ^
  (
    monkey\s*,?\s*chimp\s*:
  |
    onkey\s*,?\s*chimp\s*:
  |
    mnkey\s*,?\s*chimp\s*:
  |
    ...
  )
  $
/ix

However, this is a very long-winded approach and still won't account for all sorts of other fuzzy-matches like "Monkey, Chinp:" or "Monkey; Chimp:".


An alternative approach you could take is to first check the length of the string:

/^\w{10,15}$/

and then perform some very-fuzzy match on it:

/m?o?n?k?e?y?\s*,?\s*c?h?i?m?p?\s*:/i

However, you'd need to be careful here since there may be some bizarre results included in the match list, such as:

"mon      c:"

I would recommend taking a different, non-regex approach of utilising a Levenshtein Distance library. This will allow you to set generic boundaries on "how closely the string needs to match Monkey, Chimp"

Upvotes: 2

azro
azro

Reputation: 54148

You can use regex like this, this is not very beautiful but your example is strange too

First use case insensitive :(https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#CASE_INSENSITIVE)

I don't know solution in one treatment but you can first check for "m?o?n?k?e?y?\s+,?\s+c?h?i?m?p?(\s+)?:" and then for length in another test, this will be easy

Upvotes: 0

Gilrich
Gilrich

Reputation: 315

^\w{10,10}$ # allows words of exactly 10 characters. Set it to length - 1. Then make each of the characters optional.

I think just {10} works as well.

Upvotes: 0

Related Questions