DrakeMurdoch
DrakeMurdoch

Reputation: 859

Filter by condition if string contains certain substring

I have a dataframe full of emails. Knowing that gmail has a 6 character minimum I want to filter my dataframe by getting rid of any gmail address that has a username of less than six characters. Therefore, the dataframe df

>> print(df)

        email          
1   [email protected]             
2   [email protected]      
3   [email protected]        
4   [email protected]              
5   [email protected]              

would become:

        email                     
2   [email protected]      
3   [email protected]        
4   [email protected]              

Using

df = df[
        (len(df['email'].str.split('@').str[0]) >= 6)
        (df['email'].str.split('@').str[1] == 'gmail.com')
       ]

will filter everything that isn't @gmail.com, so I can't use that. What I want is essentially (which obviously doesn't work and gives a TypeError: 'method' object is not subscriptable)

if df['email'].str.split['@'].str[1] == 'gmail.com':
    len(df['email'].str.split['@'].str[0]) >= 6

How do I accomplish this in a vectorized operation?

Upvotes: 0

Views: 1127

Answers (4)

Gary
Gary

Reputation: 899

One way is to store the index in a list and then display just those indices:

ls=[]
for i in range(0,len(df)):
    if df['email'][i].split('@')[1] == 'gmail.com':
        if len(df['email'][i].split('@')[0]) >= 6:
            ls.append(i)

df[df.index.isin(ls)]

Output:

                  email
1  [email protected]
2    [email protected]
3        [email protected]

Upvotes: 0

Georgina Skibinski
Georgina Skibinski

Reputation: 13387

You can do:

df=df.loc[~df.email.str.contains(r"^.{0,5}@gmail\.com$")]

Outputs:

                  email
1  [email protected]
2    [email protected]
3        [email protected]

Upvotes: 1

anky
anky

Reputation: 75080

You can use:

a = df['email'].str.contains('gmail') #check if email has gmail
b = df['email'].str.split('@').str[0].str.len().gt(6) #check if length before "@" > 6
out = df[a&b|~a]

print(out)

                  email
2  [email protected]
3    [email protected]
4        [email protected]

Upvotes: 2

adrtam
adrtam

Reputation: 7211

See this:

>>> df[(df["email"].str.split("@").str[0].str.len() >= 6) | (df["email"].str.split("@").str[1] != 'gmail.com')]
                  email
1  [email protected]
2    [email protected]
3        [email protected]

Regarding you saying "will filter everything that isn't @gmail.com", it is not correct. You just need to make your boolean logic right (like above). Also to measure the string length in dataframe, you should use .str.len() but not taking len of the whole dataframe output, which the latter will be the size of the dataframe.

Upvotes: 1

Related Questions