Reputation: 859
I have a dataframe full of emails. Knowing that gmail has a 6 character minimum I want to filter my dataframe by getting rid of any gmail address that has a username of less than six characters. Therefore, the dataframe df
>> print(df)
email
1 [email protected]
2 [email protected]
3 [email protected]
4 [email protected]
5 [email protected]
would become:
email
2 [email protected]
3 [email protected]
4 [email protected]
Using
df = df[
(len(df['email'].str.split('@').str[0]) >= 6)
(df['email'].str.split('@').str[1] == 'gmail.com')
]
will filter everything that isn't @gmail.com, so I can't use that. What I want is essentially (which obviously doesn't work and gives a TypeError: 'method' object is not subscriptable
)
if df['email'].str.split['@'].str[1] == 'gmail.com':
len(df['email'].str.split['@'].str[0]) >= 6
How do I accomplish this in a vectorized operation?
Upvotes: 0
Views: 1127
Reputation: 899
One way is to store the index in a list and then display just those indices:
ls=[]
for i in range(0,len(df)):
if df['email'][i].split('@')[1] == 'gmail.com':
if len(df['email'][i].split('@')[0]) >= 6:
ls.append(i)
df[df.index.isin(ls)]
Output:
email
1 [email protected]
2 [email protected]
3 [email protected]
Upvotes: 0
Reputation: 13387
You can do:
df=df.loc[~df.email.str.contains(r"^.{0,5}@gmail\.com$")]
Outputs:
email
1 [email protected]
2 [email protected]
3 [email protected]
Upvotes: 1
Reputation: 75080
You can use:
a = df['email'].str.contains('gmail') #check if email has gmail
b = df['email'].str.split('@').str[0].str.len().gt(6) #check if length before "@" > 6
out = df[a&b|~a]
print(out)
email
2 [email protected]
3 [email protected]
4 [email protected]
Upvotes: 2
Reputation: 7211
See this:
>>> df[(df["email"].str.split("@").str[0].str.len() >= 6) | (df["email"].str.split("@").str[1] != 'gmail.com')]
email
1 [email protected]
2 [email protected]
3 [email protected]
Regarding you saying "will filter everything that isn't @gmail.com", it is not correct. You just need to make your boolean logic right (like above). Also to measure the string length in dataframe, you should use .str.len()
but not taking len
of the whole dataframe output, which the latter will be the size of the dataframe.
Upvotes: 1