Jihjohn
Jihjohn

Reputation: 438

Pandas empty dataframe behaving strangely

def opinion_df_gen(preprocessor):
    op_files = {'POSITIVE': Path('clustering/resources/positive_words.txt').resolve(),
                'NEGATIVE':  Path('clustering/resources/negative_words.txt').resolve()}
    df_ = pd.DataFrame()
    for i, (sentiment, filepath) in enumerate(op_files.items()):
        print(filepath)
        word_set = preprocessor.lemmatize_words(file_path=filepath)
        print(len(word_set))
        df_['tokens'] = list(word_set)

preprocessor is a custom class and preprocessor.lemmatize_words return a set of words/tokens. The problem is df_['tokens'] = list(word_set) throws an error.

Traceback (most recent call last):
  File "E:/Anoop/cx-index-score/sentiment_analysis/main.py", line 184, in <module>
    main()
  File "E:/Anoop/cx-index-score/sentiment_analysis/main.py", line 155, in main
    to_label_df, token_col, target_col = opinion_df_gen(preprocessor)
  File "E:/Anoop/cx-index-score/sentiment_analysis/main.py", line 35, in opinion_df_gen
    df_['tokens'] = list(word_set)
  File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\frame.py", line 3044, in __setitem__
    self._set_item(key, value)
  File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\frame.py", line 3120, in _set_item
    value = self._sanitize_column(key, value)
  File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\frame.py", line 3768, in _sanitize_column
    value = sanitize_index(value, self.index)
  File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\internals\construction.py", line 748, in sanitize_index
    "Length of values "
ValueError: Length of values (4085) does not match length of index (1782)

But if you see the code, the dataframe is empty. I don't understand it says index has length 1782. This code should work according to my understanding. The pandas version is 1.1.5 and python version is 3.8.0.

Upvotes: 0

Views: 80

Answers (1)

SuperUser
SuperUser

Reputation: 73

The problem is you are calling same column in a for loop. You have two iterations: first is for positive words and second is for negative words. When you have completed first iteration , the column df_['tokens'] has words_set of positive_words in it . This list should have been of length 4085. So, what happens next is when the second iteration came in , i.e for negative_words, the length of words_set of negative_words is 1782, which is different to the df_['tokens']. Note that it is not empty now. Hence the error

Upvotes: 1

Related Questions