Reputation: 21
I have been stuck on what I considered originally to be a simple task for quite some while. Here I will use sample data as the actual problem data is much messier (and confidential). Essentially I have two columns both containing strings. I want to check for each row of column 'substring', if it is a substring of any of the rows of column 'string':
s1 = ['good', 'how', 'hello', 'start']
s2 = ['exit', 'hello you','where are you', 'goodbye']
test = pd.DataFrame({'substring':s1, 'string':s2})
>>> test
string substring
0 exit good
1 hello you how
2 where are you hello
3 goodbye start
Essentially I would like some indicator for each row if column A if it is a substring of anywhere in column B:
>>>test
string substring C
0 exit good True
1 hello you how False
2 where are you hello True
3 goodbye start False
I have seemed to tried many things and I have just become lost.
I have tried iterating over the rows:
sub_test = pd.DataFrame(columns=test.columns)
for index, row in test.iterrows():
a = row['substring']
delta = test[test['string'].str.contains(a)]
if len(delta.index > 1):
sub_test = pd.concat([sub_test, delta])
Which gets me some of the way and returns:
>>>sub_test
string substring
3 goodbye start
1 hello you how
I would think there is a way of doing this using lambda but I have not been successful:
test['C'] = test.apply(lambda row: row['substring'] in policies['substring'], axis = 1)
Any help would be appreciated. Thanks
Upvotes: 2
Views: 4285
Reputation: 59579
Form one big pattern that we use to extract all substrings. Then we use an isin
check to see if the substring
matched anywhere.
p = '('+'|'.join(test.substring)+')'
test['C'] = test['substring'].isin(test['string'].str.extractall(p)[0].unique())
substring string C
0 good exit True
1 how hello good you False
2 hello where are you True
3 start goodbye False
This works by str.extractall
returning a DataFrame with matches.
test['string'].str.extractall(p)
0
match
1 0 hello
3 0 good
The index is related to test
's index, not important here, with another level indicating the match number (since we use .extractall
). The value is the substring that was matched. Since our capturing group contained specific words (not a general pattern), we can use an equality check (isin
) to get the mask for the 'substring'
values.
Upvotes: 1