Reputation: 1722
I hava a pandas dataframe that has one column with conversational data. I preprocessed it in the following way:
def preprocessing(text):
return [word for word in simple_preprocess(str(text), min_len = 2, deacc = True) if word not in stop_words]
dataset['preprocessed'] = dataset.apply(lambda row: preprocessing(row['msgText']), axis = 1)
To make it one-dimensional I used (both):
processed_docs = data['preprocessed']
as well as:
processed_docs = data['preprocessed'].tolist()
Which now looks as follows:
>>> processed_docs[:2]
0 ['klinkt', 'alsof', 'zwaar', 'dingen', 'spelen...
1 ['waar', 'liefst', 'meedenk', 'betekenen', 'pe...
For both cases, I used:
dictionary = gensim.corpora.Dictionary(processed_docs)
However, in both cases I got the error:
TypeError: doc2bow expects an array of unicode tokens on input, not a single string
How can I modify my data, so that I don't get this TypeError?
Given that similar questions have been asked before, I've considered:
Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string
Based on the first answer, I tried the solution of:
dictionary = gensim.corpora.Dictionary([processed_docs.split()])
And got the error(/s):
AttributeError: 'Series'('List') object has no attribute 'split'
And in the second answer someone says that the input needs to be tokens, which already holds for me.
Furthermore, based on (TypeError: doc2bow expects an array of unicode tokens on input, not a single string when using gensim.corpora.Dictionary()), I used the .tolist()
approach as I described above, which does not work either.
Upvotes: 0
Views: 1011
Reputation: 21
Question was posted long time ago but for anyone still wondering. Pandas stores lists as strings hence the TypeError, one way of interpreting this string as a list is using:
from ast import literal_eval
And then:
dictionary = gensim.corpora.Dictionary()
for doc in processed_docs:
dictionary.add_documents([literal_eval(doc)])
Upvotes: 2
Reputation: 1202
I think you need:
dictionary = gensim.corpora.Dictionary([processed_docs[:]])
To iterate through the set. You can write [2:] to start at two and iterate to the end or [:7] to start at 0 then go to 7 or [2:7]. You can also try [:len(processed_docs)]
I hope this helps :)
Upvotes: 1