David Champion
David Champion

Reputation: 275

Mask all SSN with only partial Mask from a file with multiple SSNs

Start by disclaiming that I am horrible with Regular expressions. I want to find every instance of a Social security number in a string and mask all but the dashes (-) and the last 4 of the SSN.

Example

String someStrWithSSN = "This is an SSN,123-31-4321, and here is another 987-65-8765";
Pattern formattedPattern = Pattern.compile("^\\d{9}|^\\d{3}-\\d{2}-\\d{4}$");
Matcher formattedMatcher = formattedPattern.matcher(someStrWithSSN);

while (formattedMatcher.find()) {
    // Here is my first issue.  not finding the pattern
}

// my next issue is that I need to my String should look like this
//     "This is an SSN,XXX-XX-4321, and here is another XXX-XX-8765"

Expected results are to find each SSN and replace. The code above should produce the string, ""This is an SSN,XXX-XX-4321, and here is another XXX-XX-8765"

Upvotes: 2

Views: 1437

Answers (1)

Avi
Avi

Reputation: 2641

You can simplify this, by doing something like the following:

String initial = "This is an SSN,123-31-4321, and here is another 987-65-8765";
String processed = initial.replaceAll("\\d{3}\\-\\d{2}(?=\\-\\d{4})","XXX-XX");
System.out.println(initial);
System.out.println(processed);

Output:

This is an SSN,123-31-4321, and here is another 987-65-8765
This is an SSN,XXX-XX-4321, and here is another XXX-XX-8765

The regex \d{3}\-\d{2}(?=\-\d{4}) captures three digits followed by two digits, separated by a dash (and then followed by a dash and 4 digits, non-capturing). Using replaceAll with this regex will then create the desired masking effect.

Edit:

If you also want 9 consecutive digits to be targeted by this replacement, you can do the following:

String initial = "This is an SSN,123-31-4321, and here is another 987658765";
String processed = initial.replaceAll("\\d{3}\\-\\d{2}(?=\\-\\d{4})","XXX-XX")
                       .replaceAll("\\d{5}(?=\\d{4})","XXXXX");
System.out.println(initial);
System.out.println(processed);

Output:

This is an SSN,123-31-4321, and here is another 987658765
This is an SSN,XXX-XX-4321, and here is another XXXXX8765

The regex \d{5}(?=\d{4}) captures five digits (followed by 4 digits, non-capturing). Using a second call of replaceAll will target these sequences with the appropriate replacement.

Edit: Here's a more robust version of the previous regex, and a longer demonstration of how the new regex works:

String initial = "123-45-6789 is a SSN that starts at the beginning of the string,
    and still matches. This is an SSN, 123-31-4321, and here is another 987658765. These
    have 10+ digits, so they don't match: 123-31-43214, and 98765876545.
    This (123-31-4321-blah) has 9 digits, but is followed by a dash, so it doesn't match.
    -123-31-4321 is preceded by a dash, so it doesn't match as well. :123-31-4321 is 
    preceded by a non-colon/digit, so it does match. Here's a 4-2-4 non-SSN that would've
    tricked the initial regex: 1234-56-7890. Here's two SSNs in parentheses: (777777777) 
    (777-77-7777), and here's four invalid SSNs in parentheses: (7777777778) (777-77-77778)
    (777-778-7777) (7778-77-7777). At the end of the string is a matching SSN:
    998-76-4321";
String processed = initial.replaceAll("(?<=^|[^-\\d])\\d{3}\\-\\d{2}(?=\\-\\d{4}([^-\\d]|$))","XXX-XX")
                       .replaceAll("(?<=^|[^-\\d])\\d{5}(?=\\d{4}($|\\D))","XXXXX");
System.out.println(initial);
System.out.println(processed);

Output:

123-45-6789 is a SSN that starts at the beginning of the string, and still matches. This is an SSN, 123-31-4321, and here is another 987658765. These have 10+ digits, so they don't match: 123-31-43214, and 98765876545. This (123-31-4321-blah) has 9 digits, but is followed by a dash, so it doesn't match. -123-31-4321 is preceded by a dash, so it doesn't match as well. :123-31-4321 is preceded by a non-colon/digit, so it does match. Here's a 4-2-4 non-SSN that would've tricked the initial regex: 1234-56-7890. Here's two SSNs in parentheses: (777777777) (777-77-7777), and here's four invalid SSNs in parentheses: (7777777778)(777-77-77778) (777-778-7777) (7778-77-7777). At the end of the string is a matching SSN: 998-76-4321

XXX-XX-6789 is a SSN that starts at the beginning of the string, and still matches. This is an SSN, XXX-XX-4321, and here is another XXXXX8765. These have 10+ digits, so they don't match: 123-31-43214, and 98765876545. This (123-31-4321-blah) has 9 digits, but is followed by a dash, so it doesn't match. -123-31-4321 is preceded by a dash, so it doesn't match as well. :XXX-XX-4321 is preceded by a non-colon/digit, so it does match. Here's a 4-2-4 non-SSN that would've tricked the initial regex: 1234-56-7890. Here's two SSNs in parentheses: (XXXXX7777) (XXX-XX-7777), and here's four invalid SSNs in parentheses: (7777777778)(777-77-77778) (777-778-7777) (7778-77-7777). At the end of the string is a matching SSN: XXX-XX-4321

Upvotes: 2

Related Questions