Reputation: 45
Here is my dataset 'new.csv'. Also I post a glance overview here:
https://drive.google.com/file/d/17xbwgp9siPuWsPBN5rUL9VSYwl7eU0ca/view?usp=sharing
Dominant_Topic Hashtags
0 4.0 [#boycottmulan, #blacklivesmatter]
1 8.0 []
2 8.0 [#blacklivesmatter, #protests]
3 4.0 [#blacklivesmatter, #swoleinsolidarity]
4 4.0 [#starlink]
... ... ...
15995 4.0 [#verizon5gaccess, #oscars, #verizon5gaccess, #sweepstakes, #glennclose]
15996 5.0 [#blacklivesmatter, #lifewithkellykhumalo, #bushiri, #vodacomnxtlvl, #amakhosi4life]
15997 8.0 [#blm, #blacklivesmatter, #trumpkillsamericans]
15998 3.0 [#tmobiletrucks, #str, #allin, #tmogoeslocal]
15999 0.0 [#tdcsvm, #officialtdcent, #dmv, #blacklivesmatter, #fox5dc]
My target is to drop the all hashtag counts less than, let's say, 80 times in the datasets 'new' and use the rest hashtags to form matrix (from column 4) like follows:
As you can see, from column 4, it starts to code 1 or 0 to indicate if a given hashtag (the remained hashtags that appeared more than 80 times in the dataset) exist in column "Hashtags"
I made a hashtage counts as follows:
hashtags_list_new = new.loc[new.Hashtags.apply(lambda hashtags_list:hashtags_list !=[]),['Hashtags']]
# which hashtags were popular?
# create dataframe where each use of hashtag gets its own row
flattened_hashtags_new = pd.DataFrame([hashtag for hashtags_list in hashtags_list_new.Hashtags for hashtag in hashtags_list],columns=['Hashtags'])
# count of appearances of each hashtag
popular_hashtags_new =flattened_hashtags_new.groupby('Hashtags').size().reset_index(name='counts')\
.sort_values('counts',ascending=False)\
.reset_index(drop=True)
popular_hashtags_new
The result is :
Hashtags counts
0 #blacklivesmatter 11379
1 #blm 1022
2 #jacobblake 565
3 #verizon5gaccess 510
4 #sweepstakes 496
... ... ...
1484 #idolportraits 11
1485 #augmentedreality 11
1486 #smartphones 11
1487 #ar 11
1488 #cx 11
But I have no idea on how to get my target. Can anyone help me to solve it? Thank you for attention.
Upvotes: 0
Views: 108
Reputation: 11321
Could you check if this fits your needs (I'm assuming your base dataframe is named new
):
from collections import Counter
counts = new.Hashtags.map(Counter).sum().most_common(80)
top_hashtags = [ht for ht, _ in counts]
hashtags = new.Hashtags.map(set)
base_cols = ['Dominant_Topic', 'Hashtags']
matrix = pd.concat(
[new[base_cols]] +
[
hashtags.map(lambda hts: ht in hts).astype(int)
for ht in top_hashtags
],
axis='columns'
)
matrix.columns = base_cols + top_hashtags
I'm using a Counter and its method most_common
to select the top 80 hashtags (counts
), and then select the keys to produce a sorted list of the top 80 hashtags (top_hashtags
). Then I convert the lists in new.Hashtag
into sets (hashtags
), because membership tests for sets are much more efficient. Afterwards I create for each top hashtag a 0/1-series which indicates if this hashtag is present in the respective lists in new.Hashtag
, concatenate them all, including the new
dataframe at the beginning, and name the result matrix
.
Upvotes: 1