Chris Tei
Chris Tei

Reputation: 316

Java Regex - Remove Non-Alphanumeric characters except line breaks

I'm trying to remove all the non-alphanumeric characters from a String in Java but keep the carriage returns. I have the following regular expression, but it keeps joining words before and after a line break.

[^\\p{Alnum}\\s]

How would I be able to preserve the line breaks or convert them into spaces so that I don't have words joining?

An example of this issue is shown below:

Original Text

and refreshingly direct
when compared with the hand-waving of Swinburne.

After Replacement:

 and refreshingly directwhen compared with the hand-waving of Swinburne.

Upvotes: 0

Views: 2164

Answers (4)

Bunarro
Bunarro

Reputation: 1810

That's a perfect case for Guava's CharMatcher:

String input = "and refreshingly direct\n\rwhen compared with the hand-waving of Swinburne.";
String output = CharMatcher.javaLetterOrDigit().or(CharMatcher.whitespace()).retainFrom(input);

Output will be:

and refreshingly direct
when compared with the handwaving of Swinburne

Upvotes: 0

Chris Tei
Chris Tei

Reputation: 316

I made a mistake with my code. I was reading in a file line by line and building the String, but didn't add a space at the end of each line. Therefore there were no actual line breaks to replace.

Upvotes: 0

Youcef LAIDANI
Youcef LAIDANI

Reputation: 59960

You can use this regex [^A-Za-z0-9\\n\\r] for example :

String result = str.replaceAll("[^a-zA-Z0-9\\n\\r]", "");

Example

Input

aaze03.aze1654aze987  */-a*azeaze\n hello *-*/zeaze+64\nqsdoi

Output

aaze03aze1654aze987aazeaze
hellozeaze64
qsdoi

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

You may add these chars to the regex, not \s, as \s matches any whitespace:

String reg = "[^\\p{Alnum}\n\r]";

Or, you may use character class subtraction:

String reg = "[\\P{Alnum}&&[^\n\r]]";

Here, \P{Alnum} matches any non-alphanumeric and &&[^\n\r] prevents a LF and CR from matching.

A Java test:

String s = "&&& Text\r\nNew line".replaceAll("[^\\p{Alnum}\n\r]+", "");
System.out.println(s); 
// => Text
Newline

Note that there are more line break chars than LF and CR. In Java 8, \R construct matches any style linebreak and it matches \u000D\u000A|\[\u000A\u000B\u000C\u000D\u0085\u2028\u2029\].

So, to exclude matching any line breaks, you may use

String reg = "[^\\p{Alnum}\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]+";

Upvotes: 3

Related Questions