Reputation: 65
I have some text like this:
CustomerID:1111,
text1
CustomerID:2222,
text2
CustomerID:3333,
text3
CustomerID:4444,
text4
CustomerID:5555,
text5
Each text has multiple lines.
I want to store the customer id and the text for each id in tuples (e.g. (1111, text1)
, (2222, text2)
, etc).
First, I use the expression below:
re.findall('CustomerID:(\d+)(.*?)CustomerID:', rawtxt, re.DOTALL)
However, I only get (1111, text1)
, (3333, text3)
, (5555, text5)
.....
Upvotes: 4
Views: 112
Reputation: 1506
re.findall is not the best tool for this, since regex is always greedy and will try to gobble up all the subsequent customerID's with the text.
A tool practically created for this is re.split. Brackets capture the id number and filter out "CustomerID". A second line stitches tokens into tuples the way you wanted:
toks = re.split(r'CustomerID:(\d{4}),\n', t)
zip(toks[1::2],toks[2::2])
EDIT: corrected index in zip(). Sample output after correction:
[('1111', 'text1\n'),
('2222', 'text2\n'),
('3333', 'text3\n'),
('4444', 'text4\n'),
('5555', 'text5')]
Upvotes: 0
Reputation: 5292
Another simple one may be-
>>>re.findall(r'(\b\d+\b),\s*(\btext\d+\b)', rawtxt)
>>>[('1111', 'text1'), ('2222', 'text2'), ('3333', 'text3'), ('4444', 'text4'), ('5555', 'text5')]
Edit-
If needed (for worse ordered data) use filter
filter(lambda x: len(x)>1,re.findall(r'(\b\d+\b),\s*(\btext\d+\b)', rawtxt))
SEE DEMO Live Demo
Upvotes: 1
Reputation: 103744
Given:
>>> txt='''\
... CustomerID:1111,
...
... text1
...
... CustomerID:2222,
...
... text2
...
... CustomerID:3333,
...
... text3
...
... CustomerID:4444,
...
... text4
...
... CustomerID:5555,
...
... text5'''
You can do:
>>> [re.findall(r'^(\d+),\s+(.+)', block) for block in txt.split('CustomerID:') if block]
[[('1111', 'text1')], [('2222', 'text2')], [('3333', 'text3')], [('4444', 'text4')], [('5555', 'text5')]]
If it is multiline text, you can do:
>>> [re.findall(r'^(\d+),\s+([\s\S]+)', block) for block in txt.split('CustomerID:') if block]
[[('1111', 'text1\n\n')], [('2222', 'text2\n\n')], [('3333', 'text3\n\n')], [('4444', 'text4\n\n')], [('5555', 'text5')]]
Upvotes: 1
Reputation: 67968
re.findall(r'CustomerID:(\d+),\s*(.*?)\s*(?=CustomerID:|$)', rawtxt, re.DOTALL)
Findall returns only the groups
. use a lookahead
for stopping the non greedy
quantifier.Its also suggested to use r
or raw
mode to specify your regexes.If you dont use lookahead
then customerid
for next match will be consumed and so next match will not present.Overlapping matches has to be removed by using lookahead
which do not consume string
Upvotes: 2
Reputation: 22282
Actually no need regex here:
>>> with open('file') as f:
... rawtxt = [i.strip() for i in f if i != '\n']
...
>>> l = []
>>> for i in [rawtxt[i:i+2] for i in range(0, len(rawtxt), 2)]:
... l.append((i[0][11:-1], i[1]))
...
...
>>> l
[('1111', 'text1'), ('2222', 'text2'), ('3333', 'text3'), ('4444', 'text4'), ('5
555', 'text5')]
>>>
If you need 1111
, 2222
, etc. be int, use l.append((int(i[0][11:-1]), i[1]))
instead of l.append((i[0][11:-1], i[1]))
.
Upvotes: 2