Matt
Matt

Reputation: 2815

Java: Multi platform string encoding issue

I have an odd situation that I haven't figured out how to handle. We have developers working on multiple platforms, the primary platform is linux, but we also have people working on OS X and Windows.

We have a set of tests that all build and run fine on Linux. But when we try to run them on OS X they fail. The failing assert is testing that two strings are equal, but there is one character that doesn't seem to be the same character in the Mac environment. I am fairly certain that this is simply because the file is encoded in a certain way and the expected string value, which is hard coded, is encoded differently. I was able to fix some other encoding issues by setting the JVM file.encoding through MAVEN-OPTS, but I have been stumped by this problem up to this point.

The structure looks something like this: some.xml --> xslt --> object assertEquals("expected value", object.valueToTest());

Any insights on how to rectify this mismatch? Or even why it would be occurring in the first place?

The header on the xml file says it is encoded in UTF-8, but it is possible that the file might be encoded differently on the file system. Is there a way for me to check what the actual encoding is?

Upvotes: 2

Views: 1130

Answers (4)

Adrian Pronk
Adrian Pronk

Reputation: 13906

If the XML file starts with <?xml ... encoding="UTF-8"?> then you can be fairly confident that it's encoded as UTF-8 on the file-system. Otherwise, open it in an editor that lets you see what the raw bytes are, e.g. emacs M-xfind-file-literally.

Alternatively, your java source code might have a funny byte in the string literal that is represented differently in different encodings. I think the compiler reads source code using the default platform encoding. To get around this portability issue, you can code any non-ascii character using \uxxxx notation. This is fine for native English language users but can be a bit tiresome for everyone else!

EDIT: Off topic, but this reminded me of a curious file I found at work in a test-case. It was an XML file that was encoded as ascii/utf-8 but the encoding tag said "UTF-16". It would look normal in simple editors like notepad that didn't take account of the XML encoding directive but would look bizarre in smart editors that read the file as UTF-16

Upvotes: 1

McDowell
McDowell

Reputation: 108899

Mostly, what Pete Kirkham said.

I was able to fix some other encoding issues by setting the JVM file.encoding through MAVEN-OPTS

Don't do this; it is not supported and may have unintended side-effects.

The correct way to specify source file encoding is in the pom.xml files.

<project>
  ...
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  ...
</project>

This ensures that the compiler will decode the source files consistently on all platfroms and is the equivalent to using javac -encoding X ...

More on encoding in source files here.

Upvotes: 1

mpontillo
mpontillo

Reputation: 13947

If the other platform is reading the character using a different encoding, you might see a failure like this.

How is the character represented in the file? You might try escaping any unicode within string constants using \uXXXX notation.

This page also provides another clue as to why this may not be working. The default encoding on the Mac is "MacRoman", which is not a subset of UTF-8. Therefore, as you suspected, the character is likely being interpreted differently.

Upvotes: 1

Pete Kirkham
Pete Kirkham

Reputation: 49311

The usual reason it occurs is if someone using one the old string <-> bytes conversions which doesn't take a parameter to specify the encoding.

It's not impossible that it's an encoding issue in the source file, though I've only moved between Windows and Linux so I've never seen it, but you should be using a Unicode escape for any code point above U00007f.

Upvotes: 1

Related Questions