Reputation: 304
I have the following dataframe:
df = pd.DataFrame()
df['full_string'] = [['apples and bananas', 'applesandbananasamongstothers', 'something else'],
['ApplesandBananas', 'apples and Bananas', 'bananas']]
df['substring'] = ['apples and bananas', 'apples and bananas']
The desired outcome is to keep the items in df['full_string'] which contain text that is found in df['substring'], while taking into account that:
Desired outcome:
df['outcome'] = [['apples and bananas', 'applesandbananasamongstothers'],
['ApplesandBananas', 'apples and Bananas', 'bananas']]
What I have tried is to get the first keyword of df['substring'] to use that as a matcher with df['full_string'], however, this did not allow me to retain the 'bananas' element in the second row of the dataframe.
(This is not working well on the dummy data):
first_keyword = []
for i in df['substring']:
first_keyword.append(i.split(' ', 1)[0])
df['first_keyword'] = first_keyword
df['C'] = [x[0].lower() in (x[1].lower()) for x in zip(df['first_keyword'], df['full_string'])]
Upvotes: 1
Views: 781
Reputation: 5776
To simplify the example, I chose to work with list containing your dummy data. You'll need to adapt it to your problem. Moreover, I interpret your sentence "The desired outcome is to keep the items in df['full_string'] which contain text that is found in df['substring']" as text = word.
full_str = ['apples and bananas', 'applesandbananasamongstothers', 'something else',
'ApplesandBananas', 'apples and Bananas', 'bananas']
sub_str = ['apples and bananas', 'red and blue']
# Extract words from sub strings
words_in_sub = [elt.split() for elt in sub_str]
# Flatten and remove duplicates
words_in_sub = list(set([item for sublist in words_in_sub for item in sublist]))
# Init output
output = list()
# Loop on the strings in full string
for full_s in full_str:
# Loop on the words to look for
for word in words_in_sub:
if word.lower() in full_s.lower():
output.append(full_s)
break
Output:
In: output
Out:
['apples and bananas',
'applesandbananasamongstothers',
'ApplesandBananas',
'apples and Bananas',
'bananas']
The lower/upper case is taken care of in the if condition. The spacing is taken care of by the in
statement. The presence of other text in full_s
is taken care of by the in
statement. The in
statement return True if the word is present somewhere in the string. The only case where it will return False while the word might be considered as present in the string is if the word is cut in to two by a space, for instance 'bana naan dapp les'
. This example would not be kept in the output list.
EDIT: With multiple rows. You could also just flatten the list and use the first code.
full_str = [['apples and bananas', 'applesandbananasamongstothers', 'something else'],
['ApplesandBananas', 'apples and Bananas', 'bananas']]
sub_str = [['apples and bananas'], ['apples and bananas']]
# Assuming same number of rows between full_str and sub_str
# And you want to keep element of full_str[k] according to sub strings in sub_str[k]
number_of_rows = len(full_str)
for k in range(number_of_rows):
# Extract words from sub strings
words_in_sub = [elt.split() for elt in sub_str[k]]
# Flatten and remove duplicates
words_in_sub = list(set([item for sublist in words_in_sub for item in sublist]))
# Init output
output = list()
# Loop on the strings in full string
for full_s in full_str[k]:
# Loop on the words to look for
for word in words_in_sub:
if word.lower() in full_s.lower():
output.append(full_s)
break
Upvotes: 1