Reputation: 11915

How to parse word-created special chars in java

I am trying to parse some word documents in java. Some of the values are things like a date range and instead of showing up like Startdate - endDate I am getting some funky characters like so

StartDate ΓÇô EndDate

This is where word puts in a special character hypen. Can you search for these characters and replace them with a regular - or something int he string so that I can then tokenize on a "-" and what is that character - ascii? unicode or what?

Edited to add some code:

 String projDateString = "08/2010 ΓÇô Present"
                Charset charset = Charset.forName("Cp1252");
                CharsetDecoder decoder = charset.newDecoder();
                ByteBuffer buf = ByteBuffer.wrap(projDateString.getBytes("Cp1252"));
                CharBuffer cbuf = decoder.decode(buf); 
                String s = cbuf.toString();
                println ("S: " + s)

                println("projDatestring: " + projDateString)

Outputs the following:

S: 08/2010 ΓÇô Present
projDatestring: 08/2010 ΓÇô Present

Also, using the same projDateString above, if I do:

projDateString.replaceAll("\u0096", "\u2013");
projDateString.replaceAll("\u0097", "\u2014");

and then print out projDateString, it still prints as

projDatestring: 08/2010 ΓÇô Present

Upvotes: 4

Answers (4)

Misa

Reputation: 51

s = s.replace( (char)145, (char)'\'');

s = s.replace( (char)8216, (char)'\''); // left single quote

s = s.replace( (char)146, (char)'\'');

s = s.replace( (char)8217, (char)'\''); // right single quote

s = s.replace( (char)147, (char)'\"');

s = s.replace( (char)148, (char)'\"');

s = s.replace( (char)8220, (char)'\"'); // left double

s = s.replace( (char)8221, (char)'\"'); // right double

s = s.replace( (char)8211, (char)'-' ); // em dash??    

s = s.replace( (char)150, (char)'-' );

http://www.coderanch.com/how-to/java/WeirdWordCharacters

Upvotes: 5

Stephen P

Reputation: 14800

You are probably getting Windows-1252 which is a character set, not an encoding. (Torgamus - Googling for Windows-1232 didn't give me anything.)

Windows-1252, formerly "Cp1252" is almost Unicode, but keeps some characters that came from Cp1252 in their same places. The En Dash is character 150 (0x96) which falls within the Unicode C1 reserved control character range and shouldn't be there.

You can search for char 150 and replace it with \u2013 which is the proper Unicode code point for En Dash.

There are quite a few other character that MS has in the 0x80 to 0x9f range, which is reserved in the Unicode standard, including Em Dash, bullets, and their "smart" quotes.

Edit: By the way, Java uses Unicode code point values for characters internally. UTF-8 is an encoding, which Java uses as the default encoding when writing Strings to files or network connections.

Say you have

String stuff = MSWordUtil.getNextChunkOfText();

Where MSWordUtil would be something that you've written to somehow get pieces of an MS-Word .doc file. It might boil down to

File myDocFile = new File(pathAndFileFromUser);
InputStream input = new FileInputStream(myDocFile);
// and then start reading chunks of the file

By default, as you read byte buffers from the file and make Strings out of them, Java will treat it as UTF-8 encoded text. There are ways, as Lord Torgamus says, to tell what encoding should be used, but without doing that Windows-1252 is pretty close to UTF-8, except there are those pesky characters that are in the C1 control range.

After getting some String like stuff above, you won't find \u2013 or \u2014 in it, you'll find 0x96 and 0x97 instead.

At that point you should be able to do

stuff.replaceAll("\u0096", "\u2013");

I don't do that in my code where I've had to deal with this issue. I loop through an input CharSequence one char at a time, decide based on 0x80 <= charValue <= 0x9f if it has to be replaced, and look up in an array what to replace it with. The above replaceAll() is far easier if all you care about is the 1252 En Dash vs. the Unicode En Dash.

Upvotes: 6

Giulio Piancastelli

Reputation: 15808

Probably, that character is an en dash, and the strange blurb you see is due to a difference between the way Word encodes that character and the way that character is decoded by whatever (other) system you are using to display it.

If I remember correctly from when I did some work on character encodings in Java, String instances always internally use UTF-8; so, within such an instance, you may search and replace a single character by its Unicode form. For example, let's say you would like to substitute smart quotes with plain double quotes: given a String s, you may write

s = s.replace('\u201c', '"');
s = s.replace('\u201d', '"');

where 201c and 201d are the Unicode code points for the opening and closing smart quotes. According to the link above on Wikipedia, the Unicode code point for the en dash is 2013.

Upvotes: 1

Pops

Reputation: 30828

Your problem almost certainly has to do with your encoding scheme not matching the encoding scheme Word saves in. Your code is probably using the Java default, likely UTF-8 if you haven't done anything to it. Your input, on the other hand, is likely Windows-1252, the default for Microsoft Word's .doc documents. See this site for more info. Notably,

Within Windows, ISO-8859-1 is replaced by Windows-1252, which often means that text copied from, say, a Microsoft Word document and pasted straight into a web page produces HTML validation errors.

So what does this mean for you? You'll have to tell your program that the input is using Windows-1252 encoding, and convert it to UTF-8. You can do this in varying flavors of "manually." Probably the most natural way is to take advantage of Java's built-in Charset class.

Windows-1252 is recognized by the IANA Charset Registry

Name: windows-1252
MIBenum: 2252
Source: Microsoft (http://www.iana.org/assignments/charset-reg/windows-1252) [Wendt]
Alias: None

so you it should be Charset-compatible. I haven't done this before myself, so I can't give you a code sample, but I will point out that there is a String constructor that takes a byte[] and a Charset as arguments.

Upvotes: 2

How to parse word-created special chars in java

Answers (4)

Related Questions