Counting the hashtags in a collection of tweets: two methods with inconsistent results

I'm playing around with a numpy dataframe containing two columns: 'tweet_text' and 'cyberbullying_type'. It was created through this dataset as follows:

df = pd.read_csv('data/cyberbullying_tweets.csv')

I'm currently trying to count the total number of hashtags used in each 'cyberbullying_type' using two different methods, each of which -I think- counts duplicates. However, each method gives me a different answer:

First Method:

import re

# Define the pattern for valid hashtags
hashtag_pattern = r'#[A-Za-z0-9]+'

# Function to count the total number of hashtags in a dataframe
def count_total_hashtags(dataframe):
    return dataframe['tweet_text'].str.findall(hashtag_pattern).apply(len).sum()

for category in df['cyberbullying_type'].unique():
    count = count_total_hashtags(df[df['cyberbullying_type'] == category])
    print(f"Number of hashtags in all tweets for the '{category}' category: {count}")

Output: 'not_cyberbullying': 3265, 'gender': 2691, 'religion': 1798, 'other_cyberbullying': 1625, 'age': 728, 'ethnicity': 1112,

Second Method:

The next method is more manual:

def count_hashtags_by_category(dataframe):
    hashtag_counts = {}
    for category in dataframe['cyberbullying_type'].unique():
        # Filter tweets by category
        category_tweets = dataframe[dataframe['cyberbullying_type'] == category]
        
        # Count hashtags in each tweet
        hashtag_counts[category] = category_tweets['tweet_text'].apply(
            lambda text: sum(1 for word in text.split() if word.startswith('#') and word[1:].isalnum())
        ).sum()
    
    return hashtag_counts

# Count hashtags for each category
hashtags_per_category = count_hashtags_by_category(df)
print(hashtags_per_category)

The output: {'not_cyberbullying': 3018, 'gender': 2416, 'religion': 1511, 'other_cyberbullying': 1465, 'age': 679, 'ethnicity': 956}

Why do the answers differ?

Upvotes: 1

Answers (2)

mfeyx

Reputation: 77

Why not use the count method on strings?

s = "#hello #world"
s.count("#")  # 2

Upvotes: 0

mozway

Reputation: 262214

Your two methods are not strictly identical. For instance, #YolsuzlukVeRüşvetYılı2014 won't be matched by the regex, but will be matched by the split+alnum approach since it contains valid word characters that are not ASCII. Also note that hashtags containing _ will be ignored by both approaches although valid.

I would suggest a simpler approach. Combine str.count and groupby.sum, this will be shorter and much more efficient than manually looping over the categories:

hashtag_pattern = r'#[\w_]+' # short regex for hashtags

df = pd.read_csv('twitter_parsed_dataset.csv')
df['Text'].str.count(hashtag_pattern).groupby(df['Annotation']).sum()

Example output:

Annotation
none      6402.0
racism     287.0
sexism    2103.0
Name: Text, dtype: float64

If you want a more complex regex to extract hashtags (ex. to ignore #1 as hashtag), you can refer to this question.

Upvotes: 0

Counting the hashtags in a collection of tweets: two methods with inconsistent results

First Method:

Second Method:

Answers (2)

Related Questions