Reputation: 167
There is a dataframe like this:
index terms
1345 ['jays', 'place', 'great', 'subway']
1543 ['described', 'communicative', 'friendly']
9874 ['great', 'sarahs', 'apartament', 'back']
2456 ['great', 'sarahs', 'apartament', 'back']
I try to create a dictionary from the corpus of comments[ 'terms' ], but I face an error message !
from gensim import corpora, models
dictionary = corpora.Dictionary( comments['terms'] )
TypeError: doc2bow expects an array of unicode tokens on input, not a single string
Upvotes: 0
Views: 5348
Reputation: 111
Each index needs to have its terms be in a sublist, all of which are nested within larger list.
theterms = [['jays', 'place', 'great', 'subway'],['described', 'communicative', 'friendly'], ['great', 'sarahs', 'apartament', 'back'],['great', 'sarahs', 'apartament', 'back']]
dictionary = corpora.Dictionary(theterms)
Upvotes: 1
Reputation: 304
First convert comments['terms']
using comments['terms'].tolist()
to a list and then run the corpora, it should work. You can do other preprocessing like stemming or stopwords removal etc. before creating your dictionary.
Upvotes: 0