Reputation: 43
I'm trying to do a twitter sentiment analysis between Johnny Depp and Amber Heard. I've extracted the data during the period of 2021 and the Pandas DataFrame for both individuals are stored in df_dict
dictionary described below. The error I am receiving is Unhashable type: 'Series'.
As far as I've learnt is that this error happens when you have a dictionary that does not have a list or anything. I first tested it with a single key but I got the same error. I'm on a roadblock and don't know how to solve this issue.
This is my preprocess method
def preprocess(df_dict, remove_rows, keep_rows):
for key, df in df_dict.items():
print(key)
initial_count = len(df_dict[key])
df_dict[key] = (
df
# Make everything lower case
.assign(Text=lambda x: x['Text'].str.lower())
# Keep the rows that mention name
.query(f'Text.str.contains("{keep_rows[key]}")')
# Remove the rows that mentioned the other three people.
.query(f'~Text.str.contains("{remove_rows[key]}")')
# Remove all the URLs
.assign(Text=lambda x:x['Text'].apply(lambda s: re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', s)))
)
final_count = len(df_dict[key])
print("%d tweets kept out of %d" % (final_count, initial_count))
return df_dict
This is the code I'm using to call preprocess method
df_dict = {
'johnny depp': johnny_data,
"amber heard": amber_data
}
remove_rows = {
'johnny depp': 'amber|heard|camila|vasquez|shannon|curry',
"amber heard": 'johnny|depp|camila|vasquez|shannon|curry'
}
keep_rows = {
'johnny depp': 'johnny|depp',
"amber heard": 'amber|heard'
}
df_test_data = preprocess(df_dict, remove_rows, keep_rows)
I hope I've cleared up my issue on this forum and since this is my first post here, so I also hope I've followed all the regular protocols regarding posting.
I am attaching the the error message I received: Code error Error part 1 Error part 2
The link to the code is down below: Colab link
Upvotes: 2
Views: 5197
Reputation: 107587
Since DataFrame.query
is really for simple logical operations, you cannot access Series methods of columns. As workaround, consider assign
of flags to then query
against. Consider also Series.str.replace
for regex clean.
df_dict[key] = (
df
# Make everything lower case
.assign(
Text = lambda x: x['Text'].str.lower(),
keep_flag = lambda x: x['Text'].str.contains(keep_rows[key]),
drop_flag = lambda x: x['Text'].str.contains(remove_rows[key])
)
# Keep the rows that mention name
.query("keep_flag == True")
# Remove the rows that mentioned the other three people.
.query("drop_flag == False")
# Remove all the URLs
.assign(
Text = lambda x: x['Text'].str.replace(
r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*',
'',
regex=True)
)
)
.drop(["keep_flag", "drop_flag"], axis="columns")
)
Upvotes: 1