kitchenprinzessin
kitchenprinzessin

Reputation: 1043

Python Splitting a setence based on several tokens

I want to split a sentence based on several keywords:

p = r'(?:^|\s)(standard|of|total|sum)(?:\s|$)'
re.split(p,'10-methyl-Hexadecanoic acid of total fatty acids')

This outputs:

['10-methyl-Hexadecanoic acid', 'of', 'total fatty acids']

Expected output: ['10-methyl-Hexadecanoic acid', 'of', 'total', 'fatty acids']

I am not sure why the reg. expression does not split based on the token 'total'.

Upvotes: 2

Views: 72

Answers (2)

vahvero
vahvero

Reputation: 663

By string slicing:

def search(string, search_terms):
    # Init
    ret = []
    # Find all terms
    # Does not find duplicates, employ count() for that
    for term in search_terms:
        found = string.find(term)
        # Not found
        if found < 0:
            continue
        # Add index of found and length of term
        ret.append((found, len(term),))

    # Not found
    if ret == []:
        return [string]

    # Sort by index
    ret.sort(key=lambda x: x[0])

    # Init results list
    end = []
    # Do first found as it is special
    generator = iter(ret)
    ind, length = next(generator)
    # End index of match
    end_index = ind + length
    # Add both to results list
    end.append(string[:ind])
    end.append(string[ind:end_index])

    # Do for all other results
    for ind, length in generator:
        end.append(string[end_index:ind])
        end_index = ind + length
        end.append(string[ind:end_index])
    # Add rest of the string to results
    end.append(string[end_index:])
    return end

# Initiaze
search_terms = ("standard", "of", "total", "sum")
string = '10-methyl-Hexadecanoic acid of total fatty acids' 

print(search(string, search_terms))
# ['10-methyl-Hexadecanoic acid ', 'of', ' ', 'total', ' fatty acids']

Whitespaces can be removed easily if it is necessary.

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

You may use

import re
p = r'(?<!\S)(standard|of|total|sum)(?!\S)'
s = '10-methyl-Hexadecanoic acid of total fatty acids'
print([x.strip() for x in re.split(p,s) if x.strip()])
# => ['10-methyl-Hexadecanoic acid', 'of', 'total', 'fatty acids']

See the Python demo

Details

  • (?<!\S)(standard|of|total|sum)(?!\S) will match and capture into Group 1 words in the group when enclosed with whitespaces or at the string start/end.
  • Comprehension will help get rid of blank items (if x.strip()) and x.strip() will trim whitespace from each non-blank item.

Upvotes: 3

Related Questions