Reputation: 438
def opinion_df_gen(preprocessor):
op_files = {'POSITIVE': Path('clustering/resources/positive_words.txt').resolve(),
'NEGATIVE': Path('clustering/resources/negative_words.txt').resolve()}
df_ = pd.DataFrame()
for i, (sentiment, filepath) in enumerate(op_files.items()):
print(filepath)
word_set = preprocessor.lemmatize_words(file_path=filepath)
print(len(word_set))
df_['tokens'] = list(word_set)
preprocessor
is a custom class and preprocessor.lemmatize_words
return a set
of words/tokens. The problem is df_['tokens'] = list(word_set)
throws an error.
Traceback (most recent call last):
File "E:/Anoop/cx-index-score/sentiment_analysis/main.py", line 184, in <module>
main()
File "E:/Anoop/cx-index-score/sentiment_analysis/main.py", line 155, in main
to_label_df, token_col, target_col = opinion_df_gen(preprocessor)
File "E:/Anoop/cx-index-score/sentiment_analysis/main.py", line 35, in opinion_df_gen
df_['tokens'] = list(word_set)
File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\frame.py", line 3044, in __setitem__
self._set_item(key, value)
File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\frame.py", line 3120, in _set_item
value = self._sanitize_column(key, value)
File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\frame.py", line 3768, in _sanitize_column
value = sanitize_index(value, self.index)
File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\internals\construction.py", line 748, in sanitize_index
"Length of values "
ValueError: Length of values (4085) does not match length of index (1782)
But if you see the code, the dataframe is empty. I don't understand it says index has length 1782. This code should work according to my understanding. The pandas version is 1.1.5
and python version is 3.8.0
.
Upvotes: 0
Views: 80
Reputation: 73
The problem is you are calling same column in a for loop. You have two iterations: first is for positive words and second is for negative words. When you have completed first iteration , the column df_['tokens'] has words_set of positive_words in it . This list should have been of length 4085. So, what happens next is when the second iteration came in , i.e for negative_words, the length of words_set of negative_words is 1782, which is different to the df_['tokens']. Note that it is not empty now. Hence the error
Upvotes: 1