Ashok Kumar
Ashok Kumar

Reputation: 403

Regex to remove initials from full name

I have names like "D John Livingston" , "S. Jennifer Adstan" and I want only the initials to be removed from the names , "D" in the first name and "S." in the second name. How can i do it using java regex?

Upvotes: 1

Views: 1893

Answers (2)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521093

The following code snippet seems to be working well:

String input = "John O'Connel";
input = input.replaceAll("\\b[A-Z]+(?:\\.|\\s+|$)", "").trim();
System.out.println(input);

John O'Connel

Your question is chock full of edge cases, since an initial could be, for example, more than one letter, and could appear at the start, middle, or end of the name. I replaced using the pattern \s*[A-Z]+(?:\.|\b), which seems to at least cover your examples. Also, I make a call to String#trim() for some whitespace cleanup for initials at the very beginning or end.

Demo

Upvotes: 5

Patrick Parker
Patrick Parker

Reputation: 4959

For this I would consider using String replaceAll().

So how do we design the regex?

Basically there are three cases you need to consider:

  • A. a single letter at the beginning of the name (optional period), followed by one space
  • B. a single letter at the end of the name (optional period), preceded by one space
  • C. a single letter in the middle of the name (optional period), surrounded by two spaces

For the first two cases, you need to leave no spaces. So you would match one space and replace it with zero spaces.

For the last case, you need to leave one space. However, rather than handling this case explicitly, you may treat it as either A or B, since those will replace only one of the two spaces, leaving you with the desired number of spaces: 1.

So how do we combine case A and case B together? Using the symbol of |.

To prevent grabbing a single letter from a larger chain of letters, you can use the word border marker \b on the side which is not demarcated by a space character. (Normally for cases A and B, I would have used ^ and $ to explicitly match begin and end of string for this purpose. However, since we also need to handle case C in the middle of the string, we should use word border marker instead. )

And how do we represent the optional period? Since the period is a special character it must be escaped: \. Then it is marked as optional with question mark: \.? However, there's still the problem that the A. in the middle of a name might be matched as just A since period also counts as a word border. To prevent this, we add a possessive quantifier to the optional period \\.?+.

Putting all of this together, our regex would be: (\b[A-Z]\.?+ )|( [A-Z]\.?+\b) However, in the final Java string, the backslash must be escaped, so in the final Java string, each \ will appear as \\

Example code:

String pattern = "(\\b[A-Z]\\.?+ )|( [A-Z]\\.?+\\b)";
String input1 = "MC Hammer I Smash U";
String input2 = "S. Jennifer A. Adstan JR.";
System.out.println(input1.replaceAll(pattern, ""));
System.out.println(input2.replaceAll(pattern, ""));

Output:

MC Hammer Smash

Jennifer Adstan JR.

Upvotes: 1

Related Questions