remove item from list if it does not match substring, regardless of the formatting

Question

I have the following dataframe:

df = pd.DataFrame()
df['full_string'] = [['apples and bananas', 'applesandbananasamongstothers', 'something else'], 
          ['ApplesandBananas', 'apples and Bananas', 'bananas']]
df['substring'] = ['apples and bananas', 'apples and bananas']

The desired outcome is to keep the items in df['full_string'] which contain text that is found in df['substring'], while taking into account that:

casing, higher case or lowercase should not matter
the spacing between the words
the words may contain other text that is not related to the text in df['substring']

Desired outcome:

df['outcome'] = [['apples and bananas', 'applesandbananasamongstothers'], 
      ['ApplesandBananas', 'apples and Bananas', 'bananas']]

What I have tried is to get the first keyword of df['substring'] to use that as a matcher with df['full_string'], however, this did not allow me to retain the 'bananas' element in the second row of the dataframe.

(This is not working well on the dummy data):

first_keyword = []
for i in df['substring']:
    first_keyword.append(i.split(' ', 1)[0])

df['first_keyword'] = first_keyword

df['C'] = [x[0].lower() in (x[1].lower()) for x in zip(df['first_keyword'], df['full_string'])]

Mathieu · Accepted Answer

To simplify the example, I chose to work with list containing your dummy data. You'll need to adapt it to your problem. Moreover, I interpret your sentence "The desired outcome is to keep the items in df['full_string'] which contain text that is found in df['substring']" as text = word.

full_str = ['apples and bananas', 'applesandbananasamongstothers', 'something else', 
           'ApplesandBananas', 'apples and Bananas', 'bananas']
sub_str = ['apples and bananas', 'red and blue']

# Extract words from sub strings
words_in_sub = [elt.split() for elt in sub_str]
# Flatten and remove duplicates
words_in_sub = list(set([item for sublist in words_in_sub for item in sublist]))

# Init output
output = list()
# Loop on the strings in full string
for full_s in full_str:
    # Loop on the words to look for
    for word in words_in_sub:
        if word.lower() in full_s.lower():
            output.append(full_s)
            break

Output:

In: output
Out: 
['apples and bananas',
 'applesandbananasamongstothers',
 'ApplesandBananas',
 'apples and Bananas',
 'bananas']

The lower/upper case is taken care of in the if condition. The spacing is taken care of by the in statement. The presence of other text in full_s is taken care of by the in statement. The in statement return True if the word is present somewhere in the string. The only case where it will return False while the word might be considered as present in the string is if the word is cut in to two by a space, for instance 'bana naan dapp les'. This example would not be kept in the output list.

EDIT: With multiple rows. You could also just flatten the list and use the first code.

full_str = [['apples and bananas', 'applesandbananasamongstothers', 'something else'], 
            ['ApplesandBananas', 'apples and Bananas', 'bananas']]
sub_str = [['apples and bananas'], ['apples and bananas']]

# Assuming same number of rows between full_str and sub_str
# And you want to keep element of full_str[k] according to sub strings in sub_str[k]
number_of_rows = len(full_str)
for k in range(number_of_rows):
    # Extract words from sub strings
    words_in_sub = [elt.split() for elt in sub_str[k]]
    # Flatten and remove duplicates
    words_in_sub = list(set([item for sublist in words_in_sub for item in sublist]))

    # Init output
    output = list()
    # Loop on the strings in full string
    for full_s in full_str[k]:
        # Loop on the words to look for
        for word in words_in_sub:
            if word.lower() in full_s.lower():
                output.append(full_s)
                break

remove item from list if it does not match substring, regardless of the formatting

Answers (1)

Related Questions