Reputation: 2495

How to check if a string is in a longer string in pandas DataFrame?

I know it's quite straightforward to use df.str.contains() to check if the column contains a certain substring.

What if I want to do the other way around: check if the column's value is contained by a longer string? I did a search but couldn't find an answer. I thought this should be easy, like in pure python we could simply 'a' in 'abc'

I tried to use df.isin but seems it's not designed for this purpose.

Say I have a df looks like this:

       col1      col2
0     'apple'    'one'
1     'orange'   'two'
2     'banana'   'three'

I want to query this df on col1 if is contained by a string appleorangefruits, it should return me the first two rows.

Upvotes: 3

Answers (5)

Karn Kumar

Reputation: 8826

try..

>>> df[df.col1.apply(lambda x: x in 'appleorangefruits')]
     col1 col2
0   apple  one
1  orange  two

Upvotes: 1

harpan

Reputation: 8631

You need:

longstring = 'appleorangefruits'
df.loc[df['col1'].apply(lambda x: x in longstring)]

Output:

    col1    col2
0   apple   one
1   orange  two

Upvotes: 3

Little Bobby Tables

Reputation: 4744

As apply is notoriously slow I thought I'd have a play with some other ideas.

If your "long_string" is relatively short and your DataFrame is massive, you could do something weird like this.

from itertools import combinations
from random import choice

# Create a large DataFrame
df = pd.DataFrame(
    data={'test' : [choice('abcdef') for i in range(10_000_000)]}
)

long_string = 'abcdnmlopqrtuvqwertyuiop'

def get_all_substrings(input_string):
    length = len(input_string)
    return [input_string[i:j + 1] for i in range(length) for j in range(i,length)]

sub_strings = get_all_substrings(long_string)

df.test.isin(sub_strings)

This ran in about 300ms vs 2.89s for the above apply(lambda a: a in 'longer string') answers. This is ten times quicker!

Note: I used the get_all_substrings functions from How To Get All The Contiguous Substrings Of A String In Python?

Upvotes: 4

Yifei H

Reputation: 76

You can call an apply on the column, i.e.:

df['your col'].apply(lambda a: a in 'longer string')

Upvotes: 4

IWHKYB

Reputation: 491

If the string you are checking against is a constant, I believe you can achieve it by using DataFrame.apply:

df.apply(lambda row: row['mycol'] in 'mystring', axis=1)

Upvotes: 2

How to check if a string is in a longer string in pandas DataFrame?

Answers (5)

Related Questions