Reputation: 1043
I want to split a sentence based on several keywords:
p = r'(?:^|\s)(standard|of|total|sum)(?:\s|$)'
re.split(p,'10-methyl-Hexadecanoic acid of total fatty acids')
This outputs:
['10-methyl-Hexadecanoic acid', 'of', 'total fatty acids']
Expected output: ['10-methyl-Hexadecanoic acid', 'of', 'total', 'fatty acids']
I am not sure why the reg. expression does not split based on the token 'total'.
Upvotes: 2
Views: 72
Reputation: 663
By string slicing:
def search(string, search_terms):
# Init
ret = []
# Find all terms
# Does not find duplicates, employ count() for that
for term in search_terms:
found = string.find(term)
# Not found
if found < 0:
continue
# Add index of found and length of term
ret.append((found, len(term),))
# Not found
if ret == []:
return [string]
# Sort by index
ret.sort(key=lambda x: x[0])
# Init results list
end = []
# Do first found as it is special
generator = iter(ret)
ind, length = next(generator)
# End index of match
end_index = ind + length
# Add both to results list
end.append(string[:ind])
end.append(string[ind:end_index])
# Do for all other results
for ind, length in generator:
end.append(string[end_index:ind])
end_index = ind + length
end.append(string[ind:end_index])
# Add rest of the string to results
end.append(string[end_index:])
return end
# Initiaze
search_terms = ("standard", "of", "total", "sum")
string = '10-methyl-Hexadecanoic acid of total fatty acids'
print(search(string, search_terms))
# ['10-methyl-Hexadecanoic acid ', 'of', ' ', 'total', ' fatty acids']
Whitespaces can be removed easily if it is necessary.
Upvotes: 0
Reputation: 626870
You may use
import re
p = r'(?<!\S)(standard|of|total|sum)(?!\S)'
s = '10-methyl-Hexadecanoic acid of total fatty acids'
print([x.strip() for x in re.split(p,s) if x.strip()])
# => ['10-methyl-Hexadecanoic acid', 'of', 'total', 'fatty acids']
See the Python demo
Details
(?<!\S)(standard|of|total|sum)(?!\S)
will match and capture into Group 1 words in the group when enclosed with whitespaces or at the string start/end.if x.strip()
) and x.strip()
will trim whitespace from each non-blank item. Upvotes: 3