Partially match a regular expression

Question

I'm trying to search for certain words in a sentence (using PHP). these words might be split up with spaces, for whatever reason. (for example 'alpha betical' instead of 'alphabetical'). I'm comparing each group of characters divided by spaces in that sentence to a certain regular expression separately, for reasons. therefore, I cannot match 'alpha betical' to 'alphabetical' because it would try to match 'alpha' and 'betical' separately. 'alpha' does match the regular expression ('alphabetical') partially, though; if 'betical' would be added, it would match.

I need something like Java's Matcher.hitEnd(). (Returns true if the end of input was hit by the search engine in the last match operation performed by this matcher. When this method returns true, then it is possible that more input would have changed the result of the last search.) This question asks the same thing, plus a little more, but has no appropriate answer. I found this question which was answered, but only gives a solution that works for Java (mentioned in the start of this paragraph), and not PHP.

basically, if I'm matching 'alpha' to '/alphabetical/', I want something to tell me that it at least matches a part of the regular expression. (I am aware that in this case, I could switch them around and match alphabetical with '/^alpha/', but as I use it, the regular expression '/alphabetical/' would be a little more complex and therefore not suitable for the switch.. imagine something like '/[Aa]lpha-?betical(ly)?|[Ll]exicographical(ly)?/')

I know that regular expressions don't work partially, there's only matches or no matches. Is there a way to get what I want or do I have to go about my problem in an entirely different way?

mrjoltcola · Accepted Answer

A regex either matches, or it doesnt. It is a finite automata that completes or not. Now there are surely automata out there that can exit the graph at any node and return a "score", but they are non-standard.

You can add boolean logic by matching multiple regexes. Or by adding lookahead or lookbehind.

Why not just write your regex to make whitespace optional?

  /a\s*l\s*p\s*h\s*a\s*b\s*e\s*t\s*i\s*c\s*a\s*l/

matches all sorts of combinations:

  alpha betical
  al p habet i cal

If you are familiar with wildcard / prefix matching (such as SQL's LIKE function), it is pretty easy to implement. Would that suffice?

Consider a simple implementation of a string scan algorithm that doesn't use regex at all, but searches out and returns matches sorted by score, where score is the length of the match, and you can even specify a minimum score.

Example:

FindLike(haystack: s, needle: "alphabetical", minlen:5);

Should be straightforward to write a case-insensitive function to scan a string in an iterative fashion, using a search string as a prefix match, once you match the initial character, iterate both string indexes until one ends or mismatches, then return, or add the substring to results list, and continue.

That said, you might be interested in fuzzy logic or fuzzy matching or approximate matching.

http://laurikari.net/tre/about/

Fuzzy Regular Expressions

Partially match a regular expression

Answers (2)

Related Questions