Aniss Chohra
Aniss Chohra

Reputation: 453

Pandas drop duplicates on one column and keep only rows with the most frequent value in another column

I have a dataframe that looks as the following:

ip_address    malware_type
ip_1          malware_1
ip_2          malware_2
ip_1          malware_1
ip_1          malware_1
ip_1          malware_2
ip_2          malware_2
ip_2          malware_3
.
.
.

I want to drop duplicate rows based on the 'ip_address' column, however when the dropping occurs, I want to keep only the 'malware_type' value that is the most frequent for each IP. So the resulting dataframe should look like:

ip_address    malware_type
ip_1          malware_1
ip_2          malware_2
.
.
.

I would really appreciate any help to achieve the above. Thanks.

Upvotes: 3

Views: 2299

Answers (2)

Ch3steR
Ch3steR

Reputation: 20669

You can use GroupBy.agg with pd.Series.mode

df.groupby('ip_address').malware_type.agg(pd.Series.mode)

ip_address
ip_1    malware_1
ip_2    malware_2
Name: malware_type, dtype: object

You can use scipy.stats.mode here.

from scipy.stats import mode
df.groupby('ip_address').malware_type.agg(lambda x: mode(x).mode)

ip_address
ip_1    malware_1
ip_2    malware_2
Name: malware_type, dtype: object

Another is to use collection.Counter's most_common method.

def md(s):
    c = Counter(s)
    return c.most_common(1)[0][0]

df.groupby('ip_address').malware_type.agg(md)

ip_address
ip_1    malware_1
ip_2    malware_2
Name: malware_type, dtype: object

Upvotes: 0

BENY
BENY

Reputation: 323326

Let us try mode

s=df.groupby('ip_address').malware_type.agg(lambda x : x.mode()[0]) # .reset_index()
Out[56]: 
ip_address
ip_1    malware_1
ip_2    malware_2
Name: malware_type, dtype: object

Upvotes: 5

Related Questions