Guido
Guido

Reputation: 6732

Countvectorizer with foreign symbols gives swapped key-values in vocabulary dictionary

I'm using a CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
series = pd.Series(["abc", "aaa"])
CountVectorizer(analyzer='char').fit(series).vocabulary_

This results in a vocabulary with the letters as keys and the index in the vocabulary as values:

{'a': 0, 'b': 1, 'c': 2}

Now let's add some foreign (arab?) characters:

series = pd.Series(["d'ا'ر'م'ی'ن'abc", "aaa"])
CountVectorizer(analyzer='char').fit(series).vocabulary_

{'d': 4,
 "'": 0,
 'ا': 5,
 'ر': 6,
 'م': 7,
 'ی': 9,
 'ن': 8,
 'a': 1,
 'b': 2,
 'c': 3}

See how the keys and values are swapped for the foreign characters, so the character and the index are swapped. What's happening? It looks like it's due to the fact that in some languages people read from right to left? Is this part of the behaviour of Python dictionaries?

Upvotes: 2

Views: 178

Answers (1)

Claudio P
Claudio P

Reputation: 2203

The keys and values are no actually swapped. It is just a visual "bug" when printing the dictionary.

When you define a dictionary like this:

dict = {'d': 4,
 "'": 0,
 'ا': 5,
 'ر': 6,
 'م': 7,
 'ی': 9,
 'ن': 8,
 'a': 1,
 'b': 2,
 'c': 3}

You can still access the value of one of the elements with the corresponding key:

dict['م']

Which gives you the expected result of:

7

Upvotes: 1

Related Questions