Dave A
Dave A

Reputation: 2830

How do I remove copyright and other non-ASCII characters from my Java string?

I’m using Java 6 (not an option to upgrade at this time). I have a Java string that contains the following value:

My Product Edition 2014©

The last symbol is a copyright symbol (©). When this string outputs to my terminal (using bash on Mac 10.9.5), the copyright symbol renders as a question mark.

I’d like to know how to remove all characters from my string that will render as question marks on my terminal.

Upvotes: 0

Views: 2984

Answers (4)

padippist
padippist

Reputation: 1216

You can trim all characters other than non readable ASCII character using regEx and replaceAll()

public static String asciiOnly(String unicodeString)
{
    String asciiString = unicodeString.replaceAll("[^\\x20-\\x7E]", "");
    return asciiString;
}

Here is the explanation of Regular expression "[^\\x20-\\x7E]":

  • ^ - Not
  • \\x20 - Hex value representing space which is first writable ASCII character.
  • - - Represent to, ie x20 to x7E
  • \\x7E - Hex value representing ~ which is the last writable ASCII character


ASCII

ASCII Details

Upvotes: 1

dimo414
dimo414

Reputation: 48874

The "right" thing to do here is to fix your terminal, so it doesn't print squares. See How do you echo a 4-digit Unicode character in Bash? and try just echoing Unicode characters directly in your terminal. It may be as simple as ensuring your LANG environment variable is set to UTF-8 (on my Mac, $LANG is en_US.UTF-8). You might also consider using a more full-featured terminal, like iTerm2.

If you really want to strip non-ASCII characters in Java instead, there's a number of equally reasonable ways to do so, but my preference is with Guava's CharMatcher, e.g.:

String stripped = CharMatcher.ASCII.retainFrom(original);

You could use a Pattern to strip undesirable characters, but (as demonstrated by the confusion here) it's more hassle than using Guava's out of the box solution.

Upvotes: 3

Ingo
Ingo

Reputation: 36339

You better adopt the notion that there is no such thing as a "special character". However, there are a couple of reasons why some characters are not shown correctly.

Java will keep all strings in UTF-16 encoding internally. When you print a string, the characters are converted to the encoding of the corresponding output stream or output writer. Unfortunately, the java runtime tries to be smart and uses what is called the "default" encoding unless you explicitly demanded a specific encoding.

This hurts especially Windows users, where the default encoding often turns out to be some archaic Microsoft "code page". I have yet to find out where I can tell Windows that I don't want their CP 850 (which is the default whenever you have a german keyboard).

In the long run, you'll fare best when you make the following a habit:

  1. Open all your output streams (or writers) with UTF-8 encoding. Don't use System.out/System.err.
  2. Make sure you use a terminal that can handle UTF-8. If you're on windows, enter chcp 65001 to set the encoding of the cmd-window to UTF-8 and use a font that can render the UTF characters.

Upvotes: 2

K139
K139

Reputation: 3679

if you want to remove special characters, you could do some thing like this:

String s = "My Product Edition 2014©";

s = s.replaceAll("[^\\w\\s]", "");

System.out.println(s);

Output:

My Product Edition 2014

Upvotes: 1

Related Questions