vish4071
vish4071

Reputation: 5277

Pandas filter dataframe columns through substring match

I have a dataframe with multiple columns, eg:

     Name  Age   Fname
0    Alex   10   Alice
1     Bob   12     Bob
2  Clarke   13  clarke

My filter condition is to check if Name is (case-insensitive) substring of corresponding Fname.

If it was equality, something as simple as:

df[df["Name"].str.lower() == df["Fname"].str.lower()]

works. However, I want substring match, so instead of ==, I thought in would work. But that gives error as it interprets one of the arguments as pd.Series. My 1st question is Why this difference in interpretation?

Another way I tried was using .str.contains:

df[df["Fname"].str.contains(df["Name"], case=False)]

which also interprets df["Name"] as pd.Series, and of course, works for some const string in the argument.

eg. this works:
df[df["Fname"].str.contains("a", case=False)]

I want to resolve this situation, so any help in that regard is appreciated.

Upvotes: 0

Views: 1529

Answers (4)

sammywemmy
sammywemmy

Reputation: 28729

Does this other option via a list comprehension work for you:

df.loc[[left.lower() in right.lower() 
        for left, right 
        in zip(df.Name, df.Fname)]
       ]

     Name  Age   Fname
1     Bob   12     Bob
2  Clarke   13  clarke

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153510

The .str accessor is extremely loopy and slow. It is best most of the times using list comprehension.

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
import pandas.testing as pt

def list_comprehension_lower(df):
    return df[[len(set(i)) == 1 for i in (zip([x.lower() for x in df['Name']],[y.lower() for y in df['Fname']]))]]

def apply_axis_1_lower(df):
    return df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]

def dot_string_lower(df):
    return df[df["Name"].str.lower() == df["Fname"].str.lower()]

fig, ax = plt.subplots()
res = pd.DataFrame(
    index=[1, 5, 10, 30, 50, 100, 300, 500, 700, 1000, 10000],
    columns='list_comprehension_lower apply_axis_1_lower dot_string_lower'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df]*i, ignore_index=True)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit.timeit(stmt, setp, number=100)

res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True, ax=ax);

Output:

enter image description here

Now, back you your original question, You can use list_comprehension with zip and in:

df.loc[2, 'Fname'] += ' Adams'

df[[x in y for x, y in zip([x.lower() for x in df['Name']],[y.lower() for y in df['Fname']])]]

Output:

     Name  Age         Fname
1     Bob   12           Bob
2  Clarke   13  clarke Adams

Upvotes: 2

user17242583
user17242583

Reputation:

You could use .apply() with axis=1 to call a function for each row:

subset = df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]

Output:

>>> subset
     Name  Age   Fname
1     Bob   12     Bob
2  Clarke   13  clarke

Upvotes: 1

Corralien
Corralien

Reputation: 120549

You can iterate over index axis:

>>> df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]

     Name  Age   Fname
1     Bob   12     Bob
2  Clarke   13  clarke

str.contains takes a constant in first argument pat not a Series.

Upvotes: 1

Related Questions