Dian Basit
Dian Basit

Reputation: 43

Unhashable type: 'Series' in Pandas using DataFrame.query

I'm trying to do a twitter sentiment analysis between Johnny Depp and Amber Heard. I've extracted the data during the period of 2021 and the Pandas DataFrame for both individuals are stored in df_dict dictionary described below. The error I am receiving is Unhashable type: 'Series'.

As far as I've learnt is that this error happens when you have a dictionary that does not have a list or anything. I first tested it with a single key but I got the same error. I'm on a roadblock and don't know how to solve this issue.

This is my preprocess method

def preprocess(df_dict, remove_rows, keep_rows):
  for key, df in df_dict.items():
    print(key)
    initial_count = len(df_dict[key])
    df_dict[key] = (
      df
      # Make everything lower case
      .assign(Text=lambda x: x['Text'].str.lower())
      # Keep the rows that mention name 
      .query(f'Text.str.contains("{keep_rows[key]}")')
      # Remove the rows that mentioned the other three people.
      .query(f'~Text.str.contains("{remove_rows[key]}")')
      # Remove all the URLs
      .assign(Text=lambda x:x['Text'].apply(lambda s: re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', s)))
    )
    final_count = len(df_dict[key])
    print("%d tweets kept out of %d" % (final_count, initial_count))

  return df_dict

This is the code I'm using to call preprocess method

df_dict = {
    'johnny depp': johnny_data,
    "amber heard": amber_data
}

remove_rows = {
    'johnny depp': 'amber|heard|camila|vasquez|shannon|curry',
    "amber heard": 'johnny|depp|camila|vasquez|shannon|curry'
}

keep_rows = {
    'johnny depp': 'johnny|depp',
    "amber heard": 'amber|heard'
}

df_test_data = preprocess(df_dict, remove_rows, keep_rows)

I hope I've cleared up my issue on this forum and since this is my first post here, so I also hope I've followed all the regular protocols regarding posting.

I am attaching the the error message I received: Code error Error part 1 Error part 2

The link to the code is down below: Colab link

Upvotes: 2

Views: 5197

Answers (1)

Parfait
Parfait

Reputation: 107587

Since DataFrame.query is really for simple logical operations, you cannot access Series methods of columns. As workaround, consider assign of flags to then query against. Consider also Series.str.replace for regex clean.

df_dict[key] = (
    df
    # Make everything lower case
    .assign(
        Text = lambda x: x['Text'].str.lower(),
        keep_flag = lambda x: x['Text'].str.contains(keep_rows[key]),
        drop_flag = lambda x: x['Text'].str.contains(remove_rows[key])
    )
    # Keep the rows that mention name 
    .query("keep_flag == True")
    # Remove the rows that mentioned the other three people.
    .query("drop_flag == False")
    # Remove all the URLs
    .assign(
        Text = lambda x: x['Text'].str.replace(
            r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', 
            '', 
            regex=True)
        )
    )
    .drop(["keep_flag", "drop_flag"], axis="columns")
)

Upvotes: 1

Related Questions