How to input a series/list consisting of different tokens in a Gensim Dictionary?

Question

I hava a pandas dataframe that has one column with conversational data. I preprocessed it in the following way:

def preprocessing(text):
     return [word for word in simple_preprocess(str(text), min_len = 2, deacc = True) if word not in stop_words]

dataset['preprocessed'] = dataset.apply(lambda row: preprocessing(row['msgText']), axis = 1)

To make it one-dimensional I used (both):

processed_docs = data['preprocessed']

as well as:

processed_docs = data['preprocessed'].tolist()

Which now looks as follows:

>>> processed_docs[:2]
0    ['klinkt', 'alsof', 'zwaar', 'dingen', 'spelen...
1    ['waar', 'liefst', 'meedenk', 'betekenen', 'pe...

For both cases, I used:

dictionary = gensim.corpora.Dictionary(processed_docs)

However, in both cases I got the error:

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

How can I modify my data, so that I don't get this TypeError?

Given that similar questions have been asked before, I've considered:

Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Based on the first answer, I tried the solution of:

dictionary = gensim.corpora.Dictionary([processed_docs.split()])

And got the error(/s):

AttributeError: 'Series'('List') object has no attribute 'split'

And in the second answer someone says that the input needs to be tokens, which already holds for me.

Furthermore, based on (TypeError: doc2bow expects an array of unicode tokens on input, not a single string when using gensim.corpora.Dictionary()), I used the .tolist() approach as I described above, which does not work either.

Sara · Accepted Answer

I think you need:

dictionary = gensim.corpora.Dictionary([processed_docs[:]])

To iterate through the set. You can write [2:] to start at two and iterate to the end or [:7] to start at 0 then go to 7 or [2:7]. You can also try [:len(processed_docs)]

I hope this helps :)

How to input a series/list consisting of different tokens in a Gensim Dictionary?

Answers (2)

Related Questions