Conor O'Leary
Conor O'Leary

Reputation: 21

Check if a column of a pandas dataframe contains a substring for each row of a different column?

I have been stuck on what I considered originally to be a simple task for quite some while. Here I will use sample data as the actual problem data is much messier (and confidential). Essentially I have two columns both containing strings. I want to check for each row of column 'substring', if it is a substring of any of the rows of column 'string':

s1 = ['good', 'how', 'hello', 'start']
s2 = ['exit', 'hello you','where are you', 'goodbye']
test = pd.DataFrame({'substring':s1, 'string':s2})
>>> test

    string           substring
0   exit             good
1   hello you        how
2   where are you    hello
3   goodbye          start

Essentially I would like some indicator for each row if column A if it is a substring of anywhere in column B:

>>>test
    string           substring   C
0   exit             good        True
1   hello you        how         False
2   where are you    hello       True
3   goodbye          start       False

I have seemed to tried many things and I have just become lost.

I have tried iterating over the rows:

sub_test = pd.DataFrame(columns=test.columns)

    for index, row in test.iterrows():
        a = row['substring']
        delta = test[test['string'].str.contains(a)]
        if len(delta.index > 1):
            sub_test = pd.concat([sub_test, delta]) 

Which gets me some of the way and returns:

>>>sub_test

    string      substring
3   goodbye     start
1   hello you   how

I would think there is a way of doing this using lambda but I have not been successful:

test['C'] = test.apply(lambda row: row['substring'] in policies['substring'], axis = 1)

Any help would be appreciated. Thanks

Upvotes: 2

Views: 4285

Answers (1)

ALollz
ALollz

Reputation: 59579

Form one big pattern that we use to extract all substrings. Then we use an isin check to see if the substring matched anywhere.

p = '('+'|'.join(test.substring)+')'
test['C'] = test['substring'].isin(test['string'].str.extractall(p)[0].unique())

  substring          string      C
0      good            exit   True
1       how  hello good you  False
2     hello   where are you   True
3     start         goodbye  False

This works by str.extractall returning a DataFrame with matches.

test['string'].str.extractall(p)

             0
  match       
1 0      hello
3 0       good

The index is related to test's index, not important here, with another level indicating the match number (since we use .extractall). The value is the substring that was matched. Since our capturing group contained specific words (not a general pattern), we can use an equality check (isin) to get the mask for the 'substring' values.

Upvotes: 1

Related Questions