Aswin
Aswin

Reputation: 559

Java: toLowercase messes up the unicode symbols

My Code:

// Read the turkish file contents to the variable currentLine
currentLine = currentLine+"\n\n"+currentLine.toLowerCase();
// Write the contents to a new file

Output:

Yukar mavi gök asağı yağız yer yaratıldık iki arası in oğlu yaratılmış İnsan oğulları üzer ecdadı Bumın haka İste haka tah oturmuş oturarak Türk millet ülke türe idar edivermiş tanz edivermis Dört taraf hep düşman imiş Asker sevk edip dört taraf kavmi hep itaa altına almış hep muti kılmış Başlı baş eğdirmiş dizli diz çöktürmüş

yukar mavi g�k asa��� ya���z yer yarat�ld�k iki aras� in o��lu yarat�lm��� �nsan o��ullar� �zer ecdad� bum�n haka �ste haka tah oturmu�� oturarak t�rk millet �lke t�re idar edivermi�� tanz edivermis d�rt taraf hep d���man imi�� asker sevk edip d�rt taraf kavmi hep itaa alt�na alm��� hep muti k�lm��� ba��l� ba�� e��dirmi�� dizli diz ��kt�rm���

I tried toLowercase(Locale.getDefault()) and toLowercase(Locale.ROOT). I still get the same output. Why is the function returning invalid symbols?

Thanks.

Upvotes: 1

Views: 1635

Answers (2)

Thilo
Thilo

Reputation: 262724

I think the problem comes from not declaring the character encoding when reading and writing the file. In this case Java assumes your platform default character set, which may not be appropriate.

If unsure, use UTF-8, that also covers Turkish (of course, it needs to match the file you actually have to read from).

You may also have to specify the Turkish Locale when calling toLowercase, since the exact rules may depend on the language this text is in (I'm not familiar with Turkish, it may just work already with the defaults).

But then how is it that half of the file has proper encoding?

The first line has the same symbols that you read in. There was no computation done. That can work even with the wrong encoding. For lowercase transformation, Java needs to know the proper encoding.

Now the strange characters have vanished. New '?' characters appeared all over the output

Half-way there. Now that you specified the input character set on your Reader, Java can understand your Turkish characters. But it still cannot output them, so it replaces them with "?". You also need to set the output character set on your Writer.

Upvotes: 3

Santosh
Santosh

Reputation: 17923

I think you will need to pass local info in toString() method. Here is an example in Java official documentation with Turkish as an example. Without Locale info, the toString() method will use default locale.

Here is how to create Turkish Locale,

Locale trlocale= Locale.forLanguageTag("tr_TR");

Upvotes: 1

Related Questions