Mateusz Jagiełło
Mateusz Jagiełło

Reputation: 7154

Hunspell, unmunch - dump whole dictionary, encoding error

I'd like dump hunspell's pl_PL dictionary.

I found the solution: unmunch /usr/share/hunspell/pl_PL.dic /usr/share/hunspell/pl_PL.aff

But there's problem with encoding.

Part of the output:

ambasadorowaniom
ambasadorowaniach
ambasadorowa�
ambasadoruj�cy
ambasadoruj�cym

I've also tried filtering output with iconv, but the problem wasn't solved:

   affix: z�c� 4, strip: �� 2
   affix: z�ce 4, strip: �� 2
   affix: z�cej 5, strip: �� 2
stable 50 num is 470 flag G
parsing line: MAP 8
parsing line: MAP a�
parsing line: MAP c�

How can i solve this problem?

Upvotes: 4

Views: 1364

Answers (2)

Kamil Sołtysik
Kamil Sołtysik

Reputation: 31

iconv solves the problem - the dictionary file seems to be encoded with iso-latin-2, and has to be converted to utf-8:

unmunch pl_PL.dic pl_PL.aff 2>/dev/null | iconv -f iso-8859-2 -t utf8

Upvotes: 2

szszsz
szszsz

Reputation: 31

Short version: It's a problem with your console terminal. Change it to another one like xterm.

Longer: Strange. It should be UTF8. Are you sure it is not caused by your console or terminal not supporting UTF8? Check result in any UTF8 capable graphic editor. And check your LOCALE settings.

Disclaimer: I want to help. But, since I cannot comment anything (1 reputation point), request clarification or sending message to user I have to provide any answer (in my Answer) to not be deleted.

Upvotes: 1

Related Questions