Reputation: 131
I'm trying to remove duplicates per row then create a column with the count of in each row for that user.
Current DataFrame
handle tweet
0 CaptainNormal [@WayneDupreeShow, #climatechange, @Wsow]
1 Cebel6 [@NWAJimmy, @NWAJimmy, @gaystoner821]
2 davidjwalling [#infosec, #Intel, #ACM, #IEEE]
3 nolaguy_phd [@gaystoner821]
Desired DataFrame
handle tweet count
0 CaptainNormal [@WayneDupreeShow, #climatechange, @Wsow] 3
1 Cebel6 [@NWAJimmy, @gaystoner821] 2
2 davidjwalling [#infosec, #Intel, #ACM, #IEEE] 4
3 nolaguy_phd [@gaystoner821] 1
I've tried something like
df.tweet.apply(tuple).value_counts()
but returns 1 for everything.
Upvotes: 2
Views: 627
Reputation: 862771
If values are string
s, first convert:
print (type(df.loc[0, 'tweet']))
<class 'str'>
import ast
df['tweet'] = df['tweet'].apply(ast.literal_eval)
Alternative:
df['tweet'] = df['tweet'].str.strip('[]').str.split(',\s+')
And then convert to set
s and get length:
print (type(df.loc[0, 'tweet']))
<class 'list'>
df['tweet'] = df['tweet'].apply(lambda x: list(set(x)))
df['count'] = df['tweet'].str.len()
print (df)
handle tweet count
0 CaptainNormal [#climatechange, @Wsow, @WayneDupreeShow] 3
1 Cebel6 [@NWAJimmy, @gaystoner821] 2
2 davidjwalling [#ACM, #IEEE, #infosec, #Intel] 4
3 nolaguy_phd [@gaystoner821] 1
Upvotes: 1