Reputation: 2830
I’m using Java 6 (not an option to upgrade at this time). I have a Java string that contains the following value:
My Product Edition 2014©
The last symbol is a copyright symbol (©). When this string outputs to my terminal (using bash on Mac 10.9.5), the copyright symbol renders as a question mark.
I’d like to know how to remove all characters from my string that will render as question marks on my terminal.
Upvotes: 0
Views: 2984
Reputation: 1216
You can trim all characters other than non readable ASCII character using regEx and replaceAll()
public static String asciiOnly(String unicodeString)
{
String asciiString = unicodeString.replaceAll("[^\\x20-\\x7E]", "");
return asciiString;
}
Here is the explanation of Regular expression "[^\\x20-\\x7E]"
:
^
- Not\\x20
- Hex value representing space which is first writable ASCII character.-
- Represent to, ie x20 to
x7E\\x7E
- Hex value representing ~
which is the last writable ASCII characterUpvotes: 1
Reputation: 48874
The "right" thing to do here is to fix your terminal, so it doesn't print squares. See How do you echo a 4-digit Unicode character in Bash? and try just echoing Unicode characters directly in your terminal. It may be as simple as ensuring your LANG
environment variable is set to UTF-8
(on my Mac, $LANG
is en_US.UTF-8
). You might also consider using a more full-featured terminal, like iTerm2.
If you really want to strip non-ASCII characters in Java instead, there's a number of equally reasonable ways to do so, but my preference is with Guava's CharMatcher
, e.g.:
String stripped = CharMatcher.ASCII.retainFrom(original);
You could use a Pattern
to strip undesirable characters, but (as demonstrated by the confusion here) it's more hassle than using Guava's out of the box solution.
Upvotes: 3
Reputation: 36339
You better adopt the notion that there is no such thing as a "special character". However, there are a couple of reasons why some characters are not shown correctly.
Java will keep all strings in UTF-16 encoding internally. When you print a string, the characters are converted to the encoding of the corresponding output stream or output writer. Unfortunately, the java runtime tries to be smart and uses what is called the "default" encoding unless you explicitly demanded a specific encoding.
This hurts especially Windows users, where the default encoding often turns out to be some archaic Microsoft "code page". I have yet to find out where I can tell Windows that I don't want their CP 850 (which is the default whenever you have a german keyboard).
In the long run, you'll fare best when you make the following a habit:
chcp 65001
to set the encoding of the cmd-window to UTF-8 and use a font that can render the UTF characters.Upvotes: 2
Reputation: 3679
if you want to remove special characters, you could do some thing like this:
String s = "My Product Edition 2014©";
s = s.replaceAll("[^\\w\\s]", "");
System.out.println(s);
Output:
My Product Edition 2014
Upvotes: 1