Reputation: 309
I have a string of text that looks like this:
' 19,301 14,856 18,554'
Where is a space.
I'm trying to split it on the white space, but I need to retain all of the white space as an item in the new list. Like this:
[' ', '19,301',' ', '14,856', ' ', '18,554']
I have been using the following code:
re.split(r'( +)(?=[0-9])', item)
and it returns:
['', ' ', '19,301', ' ', '14,856', ' ', '18,554']
Notice that it always adds an empty element to the beginning of my list. It's easy enough to drop it, but I'm really looking to understand what is going on here, so I can get the code to treat things consistently. Thanks.
Upvotes: 5
Views: 2902
Reputation: 463
When using the re.split
method, if the capture group is matched at the start of a string, the "result will start with an empty string". The reason for this is so that join
method can behave as the inverse of the split
method.
It might not make a lot of sense for your case, where the separator matches are of varying sizes, but if you think about the case where the separators were a |
character and you wanted to perform a join on them, with the extra empty string it would work:
>> item = '|19,301|14,856|18,554'
>> items = re.split(r'\|', item)
>> print items
['', '19,301', '14,856', '18,554']
>> '|'.join(items)
'|19,301|14,856|18,554'
But without it, the initial pipe would be missing:
>> items = ['19,301', '14,856', '18,554']
>> '|'.join(items)
'19,301|14,856|18,554'
Upvotes: 4
Reputation: 87084
You can do it with re.findall()
:
>>> s = '\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s19,301\s\s\s\s\s\s\s\s\s14,856\s\s\s\s\s\s\s\s18,554'.replace('\\s',' ')
>>> re.findall(r' +|[^ ]+', s)
[' ', '19,301', ' ', '14,856', ' ', '18,554']
You said "space" in the question, so the pattern works with space. For matching runs of any whitespace character you can use:
>>> re.findall(r'\s+|\S+', s)
[' ', '19,301', ' ', '14,856', ' ', '18,554']
The pattern matches one or more whitespace characters or one or more non-whitespace character, for example:
>>> s=' \t\t ab\ncd\tef g '
>>> re.findall(r'\s+|\S+', s)
[' \t\t ', 'ab', '\n', 'cd', '\t', 'ef', ' ', 'g', ' ']
Upvotes: 3