Reputation: 2039

Java String Filter out unwanted characters

I have string like this:

−+-~*/@$^#¨%={}[häagen-dazs;:] a (le & co') jsou "výborné" <značky>?!.

And I want to end up with this:

häagen-dazs a le & co jsou výborné značky.

In comparison to How to filter string for unwanted characters using regex? I want to keep accent (diacritics) in the string.

I use following replaceAll:

str.replaceAll("[¨%=;\\:\\(\\)\\$\\[\\]\\{\\}\\<\\>\\+\\*\\−\\@\\#\\~\\?\\!\\^\\'\\\"\\|\\/]", "");

Is this correct approach?
Is there a more simple way how to keep only alphanumeric characters (as well as with accent), spaces, and & . - symbols?

Upvotes: 2

Answers (3)

cнŝdk

Reputation: 32145

You can loop through all the input String characters and test each one if it matches your wanted Regex keep it, use this Regex [a-zA-Z& \\-_\\.ýčéèêàâùû] to test upon each character individually.

This is the code you need:

    String input = "−+-~*/@$^#¨%={}[häagen-dazs;:] a (le & co') jsou výborné <značky>?!";
    StringBuffer sb =  new StringBuffer();
    for(char c : input.toCharArray()){
       if((Character.toString(c).toLowerCase()).matches("[a-zA-Z& \\-_\\.ýčéèêàâùû]")){
           sb.append(c);
       }
    }
    System.out.println(sb.toString());

Demo:

And here's a working Demo that uses this code and gives the following output:

-hagen-dazs. a le & co jsou výborné značky

Note:

It uses input.toCharArray() to get an array of chars and loop over it.
It uses (Character.toString(c).toLowerCase()).matches("[a-zA-Z& \\-_\\.ýčéèêàâùû]") to test if the iterated char matches the allowed characters Regex.
It uses a StringBuffer to construct a new String with only the allowed characters.

Upvotes: 1

Wiktor Stribiżew

Reputation: 626893

You need to use

String res = input.replaceAll("(?U)[^\\p{L}\\p{N}\\s&.-]+", "");

Note that the regex matches any character other than (because [^...] is a negated character class), one or more times (due to the + quantifier):

\p{L} - any Unicode letter
\p{N} - any Unicode digit
\s - any Unicode whitespace (\s becomes Unicode aware due to the (?U) inline Pattern.UNICODE_CHARACTER_CLASS modifier version)
& - a literal &
. - a literal .
- - a literal hyphen (as it is placed at the end of the character class

Java demo:

import java.util.*;
import java.lang.*;

class Rextester
{  
    public static void main(String args[])
    {
        String input = "−+-~*/@$^#¨%={}[häagen-dazs;:] a (le & co') jsou výborné <značky>?!";
        input = input.replaceAll("(?U)[^\\p{L}\\p{N}\\s&.-]+", "");
        System.out.println(input);
    }
}

Output: -häagen-dazs a le & co jsou výborné značky

Upvotes: 1

Maurizio Ricci

Reputation: 472

Try this

str.replaceAll("[\\\/\.\:\%\!\[\]\(\)\{\}\?\^\*\+\"\'#@$;¨=&<>-~−]", "");

Your regex had something wrong with sintax, i suggest that you build your regex step by step in order to find out immediately if there's a mistake.

Try using this site for testing regex in real time, it's very good

https://regex101.com/

Upvotes: 0

Java String Filter out unwanted characters

Answers (3)

Related Questions