Countvectorizer with foreign symbols gives swapped key-values in vocabulary dictionary

Question

I'm using a CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
series = pd.Series(["abc", "aaa"])
CountVectorizer(analyzer='char').fit(series).vocabulary_

This results in a vocabulary with the letters as keys and the index in the vocabulary as values:

{'a': 0, 'b': 1, 'c': 2}

Now let's add some foreign (arab?) characters:

series = pd.Series(["d'ا'ر'م'ی'ن'abc", "aaa"])
CountVectorizer(analyzer='char').fit(series).vocabulary_

{'d': 4,
 "'": 0,
 'ا': 5,
 'ر': 6,
 'م': 7,
 'ی': 9,
 'ن': 8,
 'a': 1,
 'b': 2,
 'c': 3}

See how the keys and values are swapped for the foreign characters, so the character and the index are swapped. What's happening? It looks like it's due to the fact that in some languages people read from right to left? Is this part of the behaviour of Python dictionaries?

Countvectorizer with foreign symbols gives swapped key-values in vocabulary dictionary

Answers (1)

Related Questions