Reputation: 1018
I have a few unstructured data like this
test1 21;
test2 22;
test3 [ 23 ];
and I want to remove the unnecessary whitespace and convert it into the list of two-item per row and the expected output should look like this
['test1', '21']
['test2', '22']
['test3', ['23']]
Now, I am using this regex sub
method to remove the unnecessary whitespace
re.sub(r"\s+", " ", z.rstrip('\n').lstrip(' ').rstrip(';')).split(' ')
Now, the problem is that it is able to replace the unnecessary whitespace into single whitespace, which is fine. But the problem I am facing in the third example, where after and before the open and close bracket respectively, it has whitespace and that I what to remove. But using the above regex I am not able to.
This is the output currently I am getting
['test1', '21']
['test2', '22']
['test3', '[', '23', ']']
You may check the example here on pythontutor.
Upvotes: 3
Views: 128
Reputation: 626893
You can use
import re
x = "test1 21"
y = " test2 22"
z = " test3 [ 23 ]"
for a in [x, y, z]:
print(re.sub(r"(?<![^[\s])\s+|\s+(?=])", "", a.rstrip('\n').lstrip(' ').rstrip(';')).split(' '))
See the Python demo. Output:
['test1', '21']
['test2', '22']
['test3', '[23]']
Details:
(?<![^[\s])\s+
- one or more whitespaces that are preceded with a [
char, whitespace or start of string|
- or\s+(?=])
- one or more whitespaces that are followed with a ]
char.Upvotes: 1
Reputation: 785216
You may use this regex with 2 capture groups:
(\w+)\s+(\[[^]]+\]|\w+);
RegEx Details:
(\w+)
: Match 1+ word characters in first capture group\s+
: Match 1+ whitespaces(\[[^]]+\]|\w+)
: Match a [...]
string or a word in second capture group;
: Match a ;
Code:
>>> import re
>>> data = '''
... test1 21;
... test2 22;
... test3 [ 23 ];
... '''
>>> res = []
>>>
>>> for i in re.findall(r'(\w+)\s+(\[[^]]+\]|\w+);', data):
... res.append([ i[0], eval(re.sub(r'^(\[)\s*|\s*(\])$', r'\1"\2', i[1])) if i[1].startswith('[') else i[1] ])
...
>>> print (res)
[['test1', '21'], ['test2', '22'], ['test3', ['23']]]
Upvotes: 2