patri
patri

Reputation: 343

Split string on punctuation or number in Python

I'm trying to split strings every time I'm encountering a punctuation mark or numbers, such as:

toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.sub('[0123456789,.?:;~!@#$%^&*()]', ' \1',toSplit).split()

The desired output would be:

['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']

However, the code above (although it properly splits where it's supposed to) removes all the numbers and punctuation marks.

Any clarification would be greatly appreciated.

Upvotes: 3

Views: 1396

Answers (3)

Sunitha
Sunitha

Reputation: 12025

Use re.split to split at whenever a alphabet range is found

>>> import re                                                              
>>> re.split(r'([A-Za-z]+)', toSplit)                                      
['', 'I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them', '']
>>>                                                                        
>>> ' '.join(re.split(r'([A-Za-z]+)', toSplit)).split()                    
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']        

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

You may tokenize strings like you have into digits, letters, and other chars that are not whitespace, letters and digits using

re.findall(r'\d+|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)

Here,

  • \d+ - 1+ digits
  • (?:[^\w\s]|_)+ - 1+ chars other than word and whitespace chars or _
  • [^\W\d_]+ - any 1+ Unicode letters.

See the regex demo.

Matching approach is more flexible than splitting as it also allows tokenizing complex structure. Say, you also want to tokenize decimal (float, double...) numbers. You will just need to use \d+(?:\.\d+)? instead of \d+:

re.findall(r'\d+(?:\.\d+)?|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit) 
             ^^^^^^^^^^^^^

See this regex demo.

Upvotes: 3

Chris
Chris

Reputation: 29752

Use re.split with capture group:

toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.split('([0-9,.?:;~!@#$%^&*()])', toSplit)
result

Output:

['I', '2', 'eat', '!', 'Apples', '2', '', '2', 'becauseilike', '?', 'Them']

If you want to split repeated numbers or punctuation, add +:

result = re.split('([0-9,.?:;~!@#$%^&*()]+)', toSplit)
result

Output:

['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']

Upvotes: 4

Related Questions