I have a dataframe full of emails. Knowing that gmail has a 6 character minimum I want to filter my dataframe by getting rid of any gmail address that has a username of less than six characters. Therefore, the dataframe df >> print(df) email 1 a@gmail.com 2 real.email@gmail.com 3 no.email@email.com 4 real@yahoo.com 5 poo@gmail.com would become: email 2 real.email@gmail.com 3 no.email@email.com 4 real@yahoo.com Using df = df[ (len(df['email'].str.split('@').str[0]) >= 6) (df['email'].str.split('@').str[1] == 'gmail.com') ] will filter everything that isn't @gmail.com, so I can't use that. What I want is essentially (which obviously doesn't work and gives a TypeError: 'method' object is not subscriptable ) if df['email'].str.split['@'].str[1] == 'gmail.com': len(df['email'].str.split['@'].str[0]) >= 6 How do I accomplish this in a vectorized operation?

You can do: df=df.loc[~df.email.str.contains(r"^.{0,5}@gmail\.com$")] Outputs: email 1 real.email@gmail.com 2 no.email@email.com 3 real@yahoo.com

pythonpandas

DrakeMurdoch

Reputation: 859

Filter by condition if string contains certain substring

I have a dataframe full of emails. Knowing that gmail has a 6 character minimum I want to filter my dataframe by getting rid of any gmail address that has a username of less than six characters. Therefore, the dataframe df

>> print(df)

        email          
1   [email protected]             
2   [email protected]      
3   [email protected]        
4   [email protected]              
5   [email protected]

would become:

        email                     
2   [email protected]      
3   [email protected]        
4   [email protected]

Using

df = df[
        (len(df['email'].str.split('@').str[0]) >= 6)
        (df['email'].str.split('@').str[1] == 'gmail.com')
       ]

will filter everything that isn't @gmail.com, so I can't use that. What I want is essentially (which obviously doesn't work and gives a TypeError: 'method' object is not subscriptable)

if df['email'].str.split['@'].str[1] == 'gmail.com':
    len(df['email'].str.split['@'].str[0]) >= 6

How do I accomplish this in a vectorized operation?

Upvotes: 0

Answers (4)

Gary

Reputation: 899

One way is to store the index in a list and then display just those indices:

ls=[]
for i in range(0,len(df)):
    if df['email'][i].split('@')[1] == 'gmail.com':
        if len(df['email'][i].split('@')[0]) >= 6:
            ls.append(i)

df[df.index.isin(ls)]

Output:

                  email
1  [email protected]
2    [email protected]
3        [email protected]

Upvotes: 0

Georgina Skibinski

Reputation: 13407

You can do:

df=df.loc[~df.email.str.contains(r"^.{0,5}@gmail\.com$")]

Outputs:

                  email
1  [email protected]
2    [email protected]
3        [email protected]

Upvotes: 1

anky

Reputation: 75140

You can use:

a = df['email'].str.contains('gmail') #check if email has gmail
b = df['email'].str.split('@').str[0].str.len().gt(6) #check if length before "@" > 6
out = df[a&b|~a]

print(out)

                  email
2  [email protected]
3    [email protected]
4        [email protected]

Upvotes: 2

adrtam

Reputation: 7241

See this:

>>> df[(df["email"].str.split("@").str[0].str.len() >= 6) | (df["email"].str.split("@").str[1] != 'gmail.com')]
                  email
1  [email protected]
2    [email protected]
3        [email protected]

Regarding you saying "will filter everything that isn't @gmail.com", it is not correct. You just need to make your boolean logic right (like above). Also to measure the string length in dataframe, you should use .str.len() but not taking len of the whole dataframe output, which the latter will be the size of the dataframe.

Upvotes: 1

Filter by condition if string contains certain substring

Answers (4)

Related Questions