DatamineR
DatamineR

Reputation: 245

Unicode encoding, .txt and Arabic (Right-to-Left) script

I wanted to create a histogram of word counts in a large sample by building a dictionary, then to print the most common words with their count, hence basically printing few key/value pairs.

However, many of the words were not in latin alphabet, so I did:

       try: 
           print key, word_dict[key]
       except: 
           print key.encode('utf-8'), word_dict[key],

When the results are printed directly into command-line interface, the non-latin Alphabet words are just unreadable, but the key/value order is maintained.

However, when I print the results into a .txt file, Arabic words are readable, the key/value pairs corresponding to such words seem to be printed in reverse order: value/key. Chinese characters however are printer in the correct order: key/value.

So I wonder is .txt so "smart" that it recognizes Arabic and prints in the Right-to-Left order? And moreover, what can I do to maintain the order of key/value I want?

Upvotes: 1

Views: 525

Answers (1)

7stud
7stud

Reputation: 48599

When the results are printed directly into command-line interface, the non-latin Alphabet words are just unreadable

That could be because your terminal/cmd_window is not set to utf-8, which you can change in the window's settings/preferences.

However, when I print the results into a .txt file, Arabic words are readable,

The program that opens your text file has a setting that tells it to interpret the bytes saved on disk as utf-8.

Upvotes: 1

Related Questions