Emil
Emil

Reputation: 1722

How to input a series/list consisting of different tokens in a Gensim Dictionary?

I hava a pandas dataframe that has one column with conversational data. I preprocessed it in the following way:

def preprocessing(text):
     return [word for word in simple_preprocess(str(text), min_len = 2, deacc = True) if word not in stop_words]

dataset['preprocessed'] = dataset.apply(lambda row: preprocessing(row['msgText']), axis = 1)

To make it one-dimensional I used (both):

processed_docs = data['preprocessed']

as well as:

processed_docs = data['preprocessed'].tolist()

Which now looks as follows:

>>> processed_docs[:2]
0    ['klinkt', 'alsof', 'zwaar', 'dingen', 'spelen...
1    ['waar', 'liefst', 'meedenk', 'betekenen', 'pe...

For both cases, I used:

dictionary = gensim.corpora.Dictionary(processed_docs)     

However, in both cases I got the error:

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

How can I modify my data, so that I don't get this TypeError?



Given that similar questions have been asked before, I've considered:

Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Based on the first answer, I tried the solution of:

dictionary = gensim.corpora.Dictionary([processed_docs.split()])

And got the error(/s):

AttributeError: 'Series'('List') object has no attribute 'split'

And in the second answer someone says that the input needs to be tokens, which already holds for me.

Furthermore, based on (TypeError: doc2bow expects an array of unicode tokens on input, not a single string when using gensim.corpora.Dictionary()), I used the .tolist() approach as I described above, which does not work either.

Upvotes: 0

Views: 1011

Answers (2)

Djensonsan
Djensonsan

Reputation: 21

Question was posted long time ago but for anyone still wondering. Pandas stores lists as strings hence the TypeError, one way of interpreting this string as a list is using:

from ast import literal_eval

And then:

dictionary = gensim.corpora.Dictionary()
for doc in processed_docs:
  dictionary.add_documents([literal_eval(doc)])

Upvotes: 2

Sara
Sara

Reputation: 1202

I think you need:

dictionary = gensim.corpora.Dictionary([processed_docs[:]])

To iterate through the set. You can write [2:] to start at two and iterate to the end or [:7] to start at 0 then go to 7 or [2:7]. You can also try [:len(processed_docs)]

I hope this helps :)

Upvotes: 1

Related Questions