loteck
loteck

Reputation: 413

HashMap destroys encoding?

I've to admit that I'm not really an expert with encoding stuff etc. I've the following problem: my program has to read an text file which contains not only std. ASCII but "special chars and languages" like "..офіціалнов назвов Російска.." So let's assume that this is the content of the file: офіціалнов назвов Російска

Now I'd like to split the whole file content in single words and create another file which list all these words in lines like:

My problem is: if I put these single words into an HashMap and read the values from it -> the encoding is lost. This is my code:

    final StringBuffer fileData = new StringBuffer(1000);
    final BufferedReader reader = new BufferedReader(
            new FileReader("fileIn.txt"));

    char[] buf = new char[1024];
    int numRead = 0;
    while ((numRead = reader.read(buf)) != -1)
    {
        final String readData = String.valueOf(buf, 0, numRead);
        fileData.append(readData);
        buf = new char[1024];
    }
    reader.close();
    String mergedContent = fileData.toString();


    mergedContent = mergedContent.replaceAll("\\<.*?>", " ");
    mergedContent = mergedContent.replaceAll("\\r\\n|\\r|\\n", " ");

    final BufferedWriter out = new BufferedWriter(
            new OutputStreamWriter(
                    new FileOutputStream("fileOut.txt")));

    final HashMap<String, String> wordsMap = new HashMap<String, String>();

    final String test[] = mergedContent.split(" ");


    for (final String string : test)
    {

        wordsMap.put(string, string);
    }

    for (final String string : wordsMap.values())
    {
        out.write(string + "\n");
    }


    out.close();

This snippet destroys the encodig. The funny thing is: if I don't put the values into the HashMap but store them immediately into the output file like:

...
        for (final String string : test)
        {
                        out.write(string + "\n");
            //wordsMap.put(string, string);
        }

        //for (final String string : wordsMap.values())
        //{
        //  out.write(string + "\n");
        //}


        out.close();

...then it works like I expect.

What I'm doing wrong?

Upvotes: 3

Views: 3914

Answers (1)

Bozho
Bozho

Reputation: 596996

Try using new InputStreamReader(new FileInputStream(file), "UTF-8") and then the same thing with the output. And make sure your file is encoded in UTF-8

The hashmap can't possibly make anything to the encoding.

Upvotes: 10

Related Questions