oerich
oerich

Reputation: 688

Why does maven give me different utf-8 characters than eclipse (test run in eclipse, fail in maven)?

My current project is concerned with parsing natural language. One test reads text from a file, removes certain characters, and tokenizes the text into single words. The test actually compares the number of unique words. In eclipse, this test is "green", in maven, I get a higher number of words than expected. Comparing the lists of words, I see the following additional words:

Looking at the text source, it contains the following characters which should be filtered away: “ ” ’

This works in eclipse, but not in maven. I am using utf-8. The files seem to be encoded correctly, in the maven pom I specify the following:

<properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <org.apache.lucene.version>3.6.0</org.apache.lucene.version>
</properties>

Edit: Here is the code that reads the file (which is, according to eclipse, encoded as UTF-8).

        BufferedReader reader = new BufferedReader(
                new FileReader(this.file));
        String line = "";
        while ((line = reader.readLine()) != null) {
            // the csv contains a text and a classification
            String[] reqCatType = line.split(";");
            String reqText = reqCatType[0].trim();
            String reqCategory = reqCatType[1].trim();
            // the tokenizer also removes unwanted characters:
            String[] sentence = this.filter.filterStopWords(this.tokenizer
                    .tokenize(reqText));
            // we use this data to train a machine learning algorithm
            this.dataSet.learn(sentence, reqCategory);
        }
        reader.close();

Edit: The following information might be useful for analyzing the problem:

mvn -v
Apache Maven 3.0.3 (r1075438; 2011-02-28 09:31:09-0800)
Maven home: /usr/share/maven
Java version: 1.6.0_33, vendor: Apple Inc.
Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Default locale: en_US, platform encoding: MacRoman
OS name: "mac os x", version: "10.6.8", arch: "x86_64", family: "mac"

Upvotes: 4

Views: 1154

Answers (1)

Tony K.
Tony K.

Reputation: 5605

So, your data file is in UTF-8. The eclipse settings on that file has no bearing, as the running Java program will be the instructions that interpret the meaning.

FileReader always uses the platform default encoding which is generally a bad idea. Eclipse is likely setting the "platorm default" for you, whereas Maven is not.

Fix your code to specify the encoding.

See JavaDoc:

To specify these values yourself, construct an InputStreamReader on a FileInputStream.

Upvotes: 4

Related Questions