Python Splitting a setence based on several tokens

Question

I want to split a sentence based on several keywords:

p = r'(?:^|\s)(standard|of|total|sum)(?:\s|$)'
re.split(p,'10-methyl-Hexadecanoic acid of total fatty acids')

This outputs:

['10-methyl-Hexadecanoic acid', 'of', 'total fatty acids']

Expected output: ['10-methyl-Hexadecanoic acid', 'of', 'total', 'fatty acids']

I am not sure why the reg. expression does not split based on the token 'total'.

Wiktor Stribiżew · Accepted Answer

You may use

import re
p = r'(? ['10-methyl-Hexadecanoic acid', 'of', 'total', 'fatty acids']

Details

(? will match and capture into Group 1 words in the group when enclosed with whitespaces or at the string start/end.


Comprehension will help get rid of blank items (if x.strip()) and x.strip() will trim whitespace from each non-blank item.

Answers (2)