Victoria L
Victoria L

Reputation: 45

Transform dataframes into matrix in Python for hashtag sets

Here is my dataset 'new.csv'. Also I post a glance overview here:

https://drive.google.com/file/d/17xbwgp9siPuWsPBN5rUL9VSYwl7eU0ca/view?usp=sharing

    Dominant_Topic  Hashtags
0       4.0         [#boycottmulan, #blacklivesmatter]
1       8.0         []
2       8.0         [#blacklivesmatter, #protests]
3       4.0         [#blacklivesmatter, #swoleinsolidarity]
4       4.0         [#starlink]
... ... ...
15995   4.0         [#verizon5gaccess, #oscars, #verizon5gaccess, #sweepstakes, #glennclose]
15996   5.0         [#blacklivesmatter, #lifewithkellykhumalo, #bushiri, #vodacomnxtlvl, #amakhosi4life]
15997   8.0         [#blm, #blacklivesmatter, #trumpkillsamericans]
15998   3.0         [#tmobiletrucks, #str, #allin, #tmogoeslocal]
15999   0.0         [#tdcsvm, #officialtdcent, #dmv, #blacklivesmatter, #fox5dc]

My target is to drop the all hashtag counts less than, let's say, 80 times in the datasets 'new' and use the rest hashtags to form matrix (from column 4) like follows:

enter image description here

As you can see, from column 4, it starts to code 1 or 0 to indicate if a given hashtag (the remained hashtags that appeared more than 80 times in the dataset) exist in column "Hashtags"

I made a hashtage counts as follows:

hashtags_list_new = new.loc[new.Hashtags.apply(lambda hashtags_list:hashtags_list !=[]),['Hashtags']]

# which hashtags were popular? 
# create dataframe where each use of hashtag gets its own row
flattened_hashtags_new = pd.DataFrame([hashtag for hashtags_list in hashtags_list_new.Hashtags for hashtag in hashtags_list],columns=['Hashtags'])

# count of appearances of each hashtag
popular_hashtags_new =flattened_hashtags_new.groupby('Hashtags').size().reset_index(name='counts')\
                                  .sort_values('counts',ascending=False)\
                                    .reset_index(drop=True)
popular_hashtags_new

The result is :

        Hashtags            counts
0       #blacklivesmatter   11379
1       #blm                1022
2       #jacobblake         565
3       #verizon5gaccess    510
4       #sweepstakes        496
... ... ...
1484    #idolportraits      11
1485    #augmentedreality   11
1486    #smartphones        11
1487    #ar                 11
1488    #cx                 11

But I have no idea on how to get my target. Can anyone help me to solve it? Thank you for attention.

Upvotes: 0

Views: 108

Answers (1)

Timus
Timus

Reputation: 11321

Could you check if this fits your needs (I'm assuming your base dataframe is named new):

from collections import Counter

counts = new.Hashtags.map(Counter).sum().most_common(80)
top_hashtags = [ht for ht, _ in counts]
hashtags = new.Hashtags.map(set)
base_cols = ['Dominant_Topic', 'Hashtags']
matrix = pd.concat(
             [new[base_cols]] +
             [
                 hashtags.map(lambda hts: ht in hts).astype(int)
                 for ht in top_hashtags
             ],
             axis='columns'
         )
matrix.columns = base_cols + top_hashtags

I'm using a Counter and its method most_common to select the top 80 hashtags (counts), and then select the keys to produce a sorted list of the top 80 hashtags (top_hashtags). Then I convert the lists in new.Hashtag into sets (hashtags), because membership tests for sets are much more efficient. Afterwards I create for each top hashtag a 0/1-series which indicates if this hashtag is present in the respective lists in new.Hashtag, concatenate them all, including the new dataframe at the beginning, and name the result matrix.

Upvotes: 1

Related Questions