Reputation: 31
I am reading in a line from a file and want to split words that are delimited by nonalphanumeric ascii characters or a break statement using re.split but I am having trouble determining how to create the correct pattern. The below code yields:
split = re.split(r'(<br>)|(\W+)', 'I code<br>A project.')
split = ['', None, 'I', '', None, 'code', '', None, '<', '', None, 'br',
'',None, '>', '', None, 'A', '', None, 'project.']
I believed I would be able to recognize a break statement or a nonascii character usig the pattern above but clearly it is incorrect. I am having trouble understanding Regex, any help fixing this would be appreciated. I would like it look like the below after split properly:
split = ['I', 'code', 'A', 'project']
Upvotes: 3
Views: 54
Reputation: 108507
You don't need the group syntax ()
:
>>> re.split(r'<br>|\W+', 'I code<br>A project.')
['I', 'code', 'A', 'project', '']
Upvotes: 1