Julian
Julian

Reputation: 187

Regex help specific to Spamassassin

I'm trying to create a filter for social security numbers and have the following regex:

\b(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}\b

The problem is that the regex also matches the following type of string in Spamassassin and I haven't been able to solve the problem.

18-007-08-9056-1462-2205

I would like it to match only if the SSN string is on its own. Examples:

18 007-08-9056 1462-2205
007-08-9056
xyz 007-08-9056
007-08-9056 xyz

Upvotes: 0

Views: 387

Answers (3)

Adam Katz
Adam Katz

Reputation: 16138

\b(?<![.-])(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}\b(?![.-])

This is the same as your regex, but it also excludes surrounding dashes and dots (feel free to add to those character classes, but ensure that the dash (-) is always at the end or else it'll create a range).

\b matches a word break. You probably know this, but that means one side of it (either before or after but not both) must be a word character (a letter, number, or underscore) and the other side (either after or before but not both) must not be a word character (it may instead be a line break or nonexistent due to having reached the beginning/end of the string). You want this, but you want to exclude a few more things too. Therefore:

\b(?<![.-]) means that after the word break, check the previous character (if any). It must not match [.-] (a single character that is either dot or dash).

\b(?![.-]) means that after the word break, the next character (if any) must not match [.-].

When I say "if any" I am referring to the possibility that there is a line break, start of file, or end of file instead. Those will all satisfy these negative lookarounds.

See also this full regex explanation, with examples, at regex101

Upvotes: 1

Ashton Wiersdorf
Ashton Wiersdorf

Reputation: 2010

Your problem is that \b matches at the word boundary, and - is considered a word boundary. You can try something like this:

(?:^|[^-\d])((?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4})(?:$|[^-\d])

Match will then be available in $1. You might be able to find more elegant solution based on your specific kind of input strings. (E.g. will the SSN always have whitespace around it? If so, you can use \s, etc.)

Upvotes: 3

Grinnz
Grinnz

Reputation: 9231

The \b assertion is a word boundary - it matches any location that transitions from a word character to a non-word character. Digits are word characters, and hyphens are not. To specify a whitespace boundary, you can use lookarounds:

(?<!\S)(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}(?!\S)

This specifies that there is no non-space character before the pattern, and no non-space character after. The lookaround allows you to specify this while still matching at the beginning or end of the string.

Upvotes: 3

Related Questions