qrs
qrs

Reputation: 131

pandas - remove duplicates, count items, from columns with multiple values

I'm trying to remove duplicates per row then create a column with the count of in each row for that user.

Current DataFrame

    handle            tweet

0   CaptainNormal     [@WayneDupreeShow, #climatechange, @Wsow]
1   Cebel6            [@NWAJimmy, @NWAJimmy, @gaystoner821]
2   davidjwalling     [#infosec, #Intel, #ACM, #IEEE]
3   nolaguy_phd       [@gaystoner821]

Desired DataFrame

    handle            tweet                                        count

0   CaptainNormal     [@WayneDupreeShow, #climatechange, @Wsow]    3
1   Cebel6            [@NWAJimmy, @gaystoner821]                   2
2   davidjwalling     [#infosec, #Intel, #ACM, #IEEE]              4
3   nolaguy_phd       [@gaystoner821]                              1

I've tried something like

df.tweet.apply(tuple).value_counts()

but returns 1 for everything.

Upvotes: 2

Views: 627

Answers (1)

jezrael
jezrael

Reputation: 862771

If values are strings, first convert:

print (type(df.loc[0, 'tweet']))
<class 'str'>

import ast
df['tweet'] = df['tweet'].apply(ast.literal_eval)

Alternative:

df['tweet'] = df['tweet'].str.strip('[]').str.split(',\s+')

And then convert to sets and get length:

print (type(df.loc[0, 'tweet']))
<class 'list'>

df['tweet'] = df['tweet'].apply(lambda x: list(set(x)))
df['count'] = df['tweet'].str.len()
print (df)
          handle                                      tweet  count
0  CaptainNormal  [#climatechange, @Wsow, @WayneDupreeShow]      3
1         Cebel6                 [@NWAJimmy, @gaystoner821]      2
2  davidjwalling            [#ACM, #IEEE, #infosec, #Intel]      4
3    nolaguy_phd                            [@gaystoner821]      1

Upvotes: 1

Related Questions