Reputation: 456
I'm playing around with a numpy dataframe containing two columns: 'tweet_text' and 'cyberbullying_type'. It was created through this dataset as follows:
df = pd.read_csv('data/cyberbullying_tweets.csv')
I'm currently trying to count the total number of hashtags used in each 'cyberbullying_type' using two different methods, each of which -I think- counts duplicates. However, each method gives me a different answer:
import re
# Define the pattern for valid hashtags
hashtag_pattern = r'#[A-Za-z0-9]+'
# Function to count the total number of hashtags in a dataframe
def count_total_hashtags(dataframe):
return dataframe['tweet_text'].str.findall(hashtag_pattern).apply(len).sum()
for category in df['cyberbullying_type'].unique():
count = count_total_hashtags(df[df['cyberbullying_type'] == category])
print(f"Number of hashtags in all tweets for the '{category}' category: {count}")
Output: 'not_cyberbullying': 3265, 'gender': 2691, 'religion': 1798, 'other_cyberbullying': 1625, 'age': 728, 'ethnicity': 1112,
The next method is more manual:
def count_hashtags_by_category(dataframe):
hashtag_counts = {}
for category in dataframe['cyberbullying_type'].unique():
# Filter tweets by category
category_tweets = dataframe[dataframe['cyberbullying_type'] == category]
# Count hashtags in each tweet
hashtag_counts[category] = category_tweets['tweet_text'].apply(
lambda text: sum(1 for word in text.split() if word.startswith('#') and word[1:].isalnum())
).sum()
return hashtag_counts
# Count hashtags for each category
hashtags_per_category = count_hashtags_by_category(df)
print(hashtags_per_category)
The output: {'not_cyberbullying': 3018, 'gender': 2416, 'religion': 1511, 'other_cyberbullying': 1465, 'age': 679, 'ethnicity': 956}
Why do the answers differ?
Upvotes: 1
Views: 41
Reputation: 77
Why not use the count
method on strings?
s = "#hello #world"
s.count("#") # 2
Upvotes: 0
Reputation: 262214
Your two methods are not strictly identical. For instance, #YolsuzlukVeRüşvetYılı2014
won't be matched by the regex, but will be matched by the split
+alnum
approach since it contains valid word characters that are not ASCII. Also note that hashtags containing _
will be ignored by both approaches although valid.
I would suggest a simpler approach. Combine str.count
and groupby.sum
, this will be shorter and much more efficient than manually looping over the categories:
hashtag_pattern = r'#[\w_]+' # short regex for hashtags
df = pd.read_csv('twitter_parsed_dataset.csv')
df['Text'].str.count(hashtag_pattern).groupby(df['Annotation']).sum()
Example output:
Annotation
none 6402.0
racism 287.0
sexism 2103.0
Name: Text, dtype: float64
If you want a more complex regex to extract hashtags (ex. to ignore #1
as hashtag), you can refer to this question.
Upvotes: 0