namalfernandolk
namalfernandolk

Reputation: 9134

Replace non UTF compliant characters characters in a meaningful way rather than simply removing them

My application is malfunctioning because of the special characters in the strings any many areas.

Eg 1 : you can see the ? character that was displaying instead of ’.

Text :
The Hilton Paris La Defense hotel is located at the foot of the Grande Arche at the very heart of Europe’s largest business district and puts you in easy reach of some of Paris’ most famous attractions. Only a few minutes from the...

Screen Shot :
enter image description here

Eg 2 : Parser exception while parsing a XML having special characters (like ’,& etc) using AXIOM.

XMLStreamReader parser = XMLInputFactory.newInstance().createXMLStreamReader(new StringBufferInputStream(responseXML));
OMElement documentElement = new StAXOMBuilder(parser).getDocumentElement();

I found many posts to remove them when they are found. Eg : How to remove bad characters that are not suitable for utf8 encoding in MySQL? remove non-UTF-8 characters from xml with declared encoding=utf-8 - Java

And I'm using following character to remove the non UTF compliant characters characters.

if (null == inString ) return null;

byte[] byteArr = inString.getBytes();

for ( int i=0; i < byteArr.length; i++ ) {
   byte ch= byteArr[i]; 
   if ( !(ch < 0x00FD && ch > 0x001F) || ch =='&' || ch=='#') {
      byteArr[i]=' ';
   }
}

return new String( byteArr );

But this lead to another problem of removing some informative characters like ’.

What I want to do is, I want to replace them in a meaningful way rather than simply removing them. Eg : ’ can be replaced by ', & can be replaced by 'and' etc. Is there any standard way to do this rather than manually replacing one by one?

Upvotes: 0

Views: 1692

Answers (1)

The javadoc for StringBufferInputStream says

Deprecated. This class does not properly convert characters into bytes. As of JDK 1.1, the preferred way to create a stream from a string is via the StringReader class.

Don't use it.

The file is read as bytes, no matter where it comes from. Never convert your data to a String if you need it as bytes in the first place.

If you're reading from a file, use a FileInputStream. (Never use FileReader, since it doesn't allow you to specify the encoding.)

Upvotes: 1

Related Questions