Reputation: 3
I'm trying to search for certain words in a sentence (using PHP). these words might be split up with spaces, for whatever reason. (for example 'alpha betical' instead of 'alphabetical'). I'm comparing each group of characters divided by spaces in that sentence to a certain regular expression separately, for reasons. therefore, I cannot match 'alpha betical' to 'alphabetical' because it would try to match 'alpha' and 'betical' separately. 'alpha' does match the regular expression ('alphabetical') partially, though; if 'betical' would be added, it would match.
I need something like Java's Matcher.hitEnd(). (Returns true if the end of input was hit by the search engine in the last match operation performed by this matcher. When this method returns true, then it is possible that more input would have changed the result of the last search.) This question asks the same thing, plus a little more, but has no appropriate answer. I found this question which was answered, but only gives a solution that works for Java (mentioned in the start of this paragraph), and not PHP.
basically, if I'm matching 'alpha'
to '/alphabetical/'
, I want something to tell me that it at least matches a part of the regular expression. (I am aware that in this case, I could switch them around and match alphabetical
with '/^alpha/'
, but as I use it, the regular expression '/alphabetical/'
would be a little more complex and therefore not suitable for the switch.. imagine something like '/[Aa]lpha-?betical(ly)?|[Ll]exicographical(ly)?/'
)
I know that regular expressions don't work partially, there's only matches or no matches. Is there a way to get what I want or do I have to go about my problem in an entirely different way?
Upvotes: 0
Views: 463
Reputation: 20842
A regex either matches, or it doesnt. It is a finite automata that completes or not. Now there are surely automata out there that can exit the graph at any node and return a "score", but they are non-standard.
You can add boolean logic by matching multiple regexes. Or by adding lookahead or lookbehind.
Why not just write your regex to make whitespace optional?
/a\s*l\s*p\s*h\s*a\s*b\s*e\s*t\s*i\s*c\s*a\s*l/
matches all sorts of combinations:
alpha betical
al p habet i cal
If you are familiar with wildcard / prefix matching (such as SQL's LIKE function), it is pretty easy to implement. Would that suffice?
Consider a simple implementation of a string scan algorithm that doesn't use regex at all, but searches out and returns matches sorted by score, where score is the length of the match, and you can even specify a minimum score.
Example:
FindLike(haystack: s, needle: "alphabetical", minlen:5);
Should be straightforward to write a case-insensitive function to scan a string in an iterative fashion, using a search string as a prefix match, once you match the initial character, iterate both string indexes until one ends or mismatches, then return, or add the substring to results list, and continue.
That said, you might be interested in fuzzy logic or fuzzy matching or approximate matching.
http://laurikari.net/tre/about/
Upvotes: 3
Reputation: 41838
Your question is vast, and this answer focuses on this part:
If I'm matching 'alpha' to '/alphabetical/', I want something to tell me that it at least matches a part of the regular expression.
Two Options
There are several ways to do this. Whichever way you choose, you will need to build the patterns programmatically.
A General Option
Here is a general way that I like to use because it is straighforward. It is a series of optional lookaheads that look further and further down the string. Inside each lookahead is a capturing group.
^(?=(a))?(?=(al))?(?=(alp))?(?=(alph))?(?=(alpha))?(?=(alphab))?(?=(alphabe))?(?=(alphabet))?(?=(alphabeti))?(?=(alphabetic))?(?=(alphabetica))?(?=(alphabetical))?(?=(alphabetical$))?
The highest capture group that is set tells you how far we matched. For instance, for alpha
, (?=(alpha))
would succeed, and Group 5 would be set (as well as groups 1, 2, 3, 4, 5).
This works in PCRE. In some engines you would need to wrap the lookarounds like so: (?:(?=(a)))?
And in some engines it wouldn't work at all.
An Option for Mutually-Exclusive Tokens
Here is another way suggested by @CasimirEtHippolyte elsewhere, and that is beautifully compact. It works when tokens cannot "eat up" text that would have been matched by following tokens, which is the case here.
^(a(l(p(h(a(b(e(t(i(c(a(l($)?)?)?)?)?)?)?)?)?)?)?)?)?
You inspect which capture groups were set. The largest capture group that was set tells you how many letters were matched.
Upvotes: 3