Reputation: 7599
I'm trying to figure out how to detect extra characters within a spam word like:
pha.rmacy
or vi*agra
any ideas?
Upvotes: 5
Views: 247
Reputation: 185862
That depends on how broadly you want to match. The following will match any contiguous sequence of non-whitespace-or-word-characters interspersed among those letters:
/p[^\s\w]*h[^\s\w]*a[^\s\w]*r[^\s\w]*m[^\s\w]*a[^\s\w]*c[^\s\w]*y/
You can build this regex in code. E.g., in Perl:
$re = join("[^\\s\\w]*", split("", "pharmacy"))
Ultimately, regexes probably won't satisfy all your requirements, though.
Upvotes: 1
Reputation: 41222
Regular expressions do not seem like the appropriate tool for figuring this out. But as an attempt to answer it just because it is interesting, a simple way would be to do something like this:
/v.?i.?a.?g.?r.?a/
It would match 0 or 1 characters between each letter.
Upvotes: 2
Reputation: 91329
You could use a (dis)similarity metric, such as edit distance. For instance, the edit distance between vi.agra and viagra is 1.
Then, you determine that a given word is the same as the spam word, if the edit distance between them is below a certain threshold like, say, 2.
But if you really want to use a regex, you can use something like /[^a-zA-Z0-9-\s]/
to remove punctuation from the word. But then again, you would fail to identify something like viZagra
as being the same word as viagra
.
Upvotes: 3