Fuxi
Fuxi

Reputation: 7599

Regex: ignore extra characters

I'm trying to figure out how to detect extra characters within a spam word like:

pha.rmacy or vi*agra

any ideas?

Upvotes: 5

Views: 247

Answers (3)

Marcelo Cantos
Marcelo Cantos

Reputation: 185862

That depends on how broadly you want to match. The following will match any contiguous sequence of non-whitespace-or-word-characters interspersed among those letters:

/p[^\s\w]*h[^\s\w]*a[^\s\w]*r[^\s\w]*m[^\s\w]*a[^\s\w]*c[^\s\w]*y/

You can build this regex in code. E.g., in Perl:

$re = join("[^\\s\\w]*", split("", "pharmacy"))

Ultimately, regexes probably won't satisfy all your requirements, though.

Upvotes: 1

Mark Wilkins
Mark Wilkins

Reputation: 41222

Regular expressions do not seem like the appropriate tool for figuring this out. But as an attempt to answer it just because it is interesting, a simple way would be to do something like this:

/v.?i.?a.?g.?r.?a/

It would match 0 or 1 characters between each letter.

Upvotes: 2

João Silva
João Silva

Reputation: 91329

You could use a (dis)similarity metric, such as edit distance. For instance, the edit distance between vi.agra and viagra is 1.

Then, you determine that a given word is the same as the spam word, if the edit distance between them is below a certain threshold like, say, 2.

But if you really want to use a regex, you can use something like /[^a-zA-Z0-9-\s]/ to remove punctuation from the word. But then again, you would fail to identify something like viZagra as being the same word as viagra.

Upvotes: 3

Related Questions