Reputation: 1946
I'm trying to drop values from a dataframe that fuzzy match items in a list.
I have a dataframe (test_df) that looks like:
id email created_at
0 1 son@mail_a.com 2017-01-21 18:19:00
1 2 boy@mail_b.com 2017-01-22 01:19:00
2 3 girl@mail_c.com 2017-01-22 01:19:00
I have a list of a few hundred email domains that I am reading in from a txt file that looks like:
mail_a.com
mail_d.com
mail_e.com
I'm trying to drop from the dataframe any row that contains a matching email domain using:
email_domains = open('file.txt', 'r')
to_drop = email_domains.read().splitlines()
dropped_df = test_df[~test_df['email'].isin(to_drop)]
print(test_df)
So, the result should look like:
id email created_at
0 2 boy@mail_b.com 2017-01-22 01:19:00
1 3 girl@mail_c.com 2017-01-22 01:19:00
But the first row with "son@mail_a.com" is not dropped. Any suggestions?
Upvotes: 1
Views: 369
Reputation: 3157
Yet another answer...This is a one liner:
exclude = ['mail_a.com','mail_d.com','mail_e.com']
df[df.apply(lambda x: all([x['email'].rfind(ex) < 0 for ex in exclude]), axis=1)]
# OUTPUT
# Out[50]:
# created_at email id
# 1 2017-01-22 01:19:00 boy@mail_b.com 2
# 2 2017-01-22 01:19:00 girl@mail_c.com 3
Here I use the rfind returns -1 if the pattern isn't found.
Upvotes: 0
Reputation: 4199
you can use apply
and split
the string and use it for your isin
print test_df[~test_df['email'].apply(lambda x: x.split('@')[1]).isin(to_drop)]
results in
created_at email
1 2017-01-22 01:19:00 boy@mail_b.com
2 2017-01-22 01:19:00 girl@mail_c.com
Upvotes: 0
Reputation:
isin
looks for exact matches. Your condition is more suitable for endswith
or contains
:
df[~df['email'].str.endswith(tuple(to_drop))]
Out:
id email created_at
1 2 boy@mail_b.com 2017-01-22 01:19:00
2 3 girl@mail_c.com 2017-01-22 01:19:00
df[~df['email'].str.contains('|'.join(to_drop))]
Out:
id email created_at
1 2 boy@mail_b.com 2017-01-22 01:19:00
2 3 girl@mail_c.com 2017-01-22 01:19:00
Upvotes: 3
Reputation: 210932
It's pretty easy to parse domain name from the email, so we can first parse domains using .str.split('@')
and then check it using isin()
method:
In [12]: df[~df.email.str.split('@').str[1].isin(domains.domain)]
Out[12]:
id email created_at
1 2 boy@mail_b.com 2017-01-22 01:19:00
2 3 girl@mail_c.com 2017-01-22 01:19:00
where:
In [13]: domains
Out[13]:
domain
0 mail_a.com
1 mail_d.com
2 mail_e.com
Upvotes: 2