FunnyChef
FunnyChef

Reputation: 1946

Python Pandas - dropping multiple values based on a list

I'm trying to drop values from a dataframe that fuzzy match items in a list.

I have a dataframe (test_df) that looks like:

   id          email         created_at      
0  1   son@mail_a.com   2017-01-21 18:19:00  
1  2   boy@mail_b.com   2017-01-22 01:19:00  
2  3  girl@mail_c.com   2017-01-22 01:19:00 

I have a list of a few hundred email domains that I am reading in from a txt file that looks like:

mail_a.com
mail_d.com
mail_e.com

I'm trying to drop from the dataframe any row that contains a matching email domain using:

email_domains = open('file.txt', 'r')
to_drop = email_domains.read().splitlines()    
dropped_df = test_df[~test_df['email'].isin(to_drop)]
    print(test_df)

So, the result should look like:

   id          email         created_at       
0  2   boy@mail_b.com   2017-01-22 01:19:00  
1  3  girl@mail_c.com   2017-01-22 01:19:00 

But the first row with "son@mail_a.com" is not dropped. Any suggestions?

Upvotes: 1

Views: 369

Answers (4)

aquil.abdullah
aquil.abdullah

Reputation: 3157

Yet another answer...This is a one liner:

exclude = ['mail_a.com','mail_d.com','mail_e.com']
df[df.apply(lambda x: all([x['email'].rfind(ex) < 0 for ex in exclude]), axis=1)]
# OUTPUT
# Out[50]:
#              created_at            email  id
# 1   2017-01-22 01:19:00   boy@mail_b.com   2
# 2   2017-01-22 01:19:00  girl@mail_c.com   3

Here I use the rfind returns -1 if the pattern isn't found.

Upvotes: 0

plasmon360
plasmon360

Reputation: 4199

you can use apply and split the string and use it for your isin

print test_df[~test_df['email'].apply(lambda x: x.split('@')[1]).isin(to_drop)]

results in

            created_at            email
1  2017-01-22 01:19:00   boy@mail_b.com
2  2017-01-22 01:19:00  girl@mail_c.com

Upvotes: 0

user2285236
user2285236

Reputation:

isin looks for exact matches. Your condition is more suitable for endswith or contains:

df[~df['email'].str.endswith(tuple(to_drop))]
Out: 
   id            email           created_at
1   2   boy@mail_b.com  2017-01-22 01:19:00
2   3  girl@mail_c.com  2017-01-22 01:19:00

df[~df['email'].str.contains('|'.join(to_drop))]
Out: 
   id            email           created_at
1   2   boy@mail_b.com  2017-01-22 01:19:00
2   3  girl@mail_c.com  2017-01-22 01:19:00

Upvotes: 3

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210932

It's pretty easy to parse domain name from the email, so we can first parse domains using .str.split('@') and then check it using isin() method:

In [12]: df[~df.email.str.split('@').str[1].isin(domains.domain)]
Out[12]:
   id            email           created_at
1   2   boy@mail_b.com  2017-01-22 01:19:00
2   3  girl@mail_c.com  2017-01-22 01:19:00

where:

In [13]: domains
Out[13]:
       domain
0  mail_a.com
1  mail_d.com
2  mail_e.com

Upvotes: 2

Related Questions