Reputation: 453
I have a dataframe that looks as the following:
ip_address malware_type
ip_1 malware_1
ip_2 malware_2
ip_1 malware_1
ip_1 malware_1
ip_1 malware_2
ip_2 malware_2
ip_2 malware_3
.
.
.
I want to drop duplicate rows based on the 'ip_address' column, however when the dropping occurs, I want to keep only the 'malware_type' value that is the most frequent for each IP. So the resulting dataframe should look like:
ip_address malware_type
ip_1 malware_1
ip_2 malware_2
.
.
.
I would really appreciate any help to achieve the above. Thanks.
Upvotes: 3
Views: 2299
Reputation: 20669
You can use GroupBy.agg
with pd.Series.mode
df.groupby('ip_address').malware_type.agg(pd.Series.mode)
ip_address
ip_1 malware_1
ip_2 malware_2
Name: malware_type, dtype: object
You can use scipy.stats.mode
here.
from scipy.stats import mode
df.groupby('ip_address').malware_type.agg(lambda x: mode(x).mode)
ip_address
ip_1 malware_1
ip_2 malware_2
Name: malware_type, dtype: object
Another is to use collection.Counter
's most_common
method.
def md(s):
c = Counter(s)
return c.most_common(1)[0][0]
df.groupby('ip_address').malware_type.agg(md)
ip_address
ip_1 malware_1
ip_2 malware_2
Name: malware_type, dtype: object
Upvotes: 0
Reputation: 323326
Let us try mode
s=df.groupby('ip_address').malware_type.agg(lambda x : x.mode()[0]) # .reset_index()
Out[56]:
ip_address
ip_1 malware_1
ip_2 malware_2
Name: malware_type, dtype: object
Upvotes: 5