mfg_2018
mfg_2018

Reputation: 65

regex text between two string python

I have some text like this:

CustomerID:1111,

text1

CustomerID:2222,

text2

CustomerID:3333,

text3

CustomerID:4444,

text4

CustomerID:5555,

text5

Each text has multiple lines.

I want to store the customer id and the text for each id in tuples (e.g. (1111, text1), (2222, text2), etc).

First, I use the expression below:

re.findall('CustomerID:(\d+)(.*?)CustomerID:', rawtxt, re.DOTALL)

However, I only get (1111, text1), (3333, text3), (5555, text5).....

Upvotes: 4

Views: 112

Answers (5)

Muposat
Muposat

Reputation: 1506

re.findall is not the best tool for this, since regex is always greedy and will try to gobble up all the subsequent customerID's with the text.

A tool practically created for this is re.split. Brackets capture the id number and filter out "CustomerID". A second line stitches tokens into tuples the way you wanted:

toks = re.split(r'CustomerID:(\d{4}),\n', t)
zip(toks[1::2],toks[2::2])

EDIT: corrected index in zip(). Sample output after correction:

[('1111', 'text1\n'),
 ('2222', 'text2\n'),
 ('3333', 'text3\n'),
 ('4444', 'text4\n'),
 ('5555', 'text5')]

Upvotes: 0

Learner
Learner

Reputation: 5292

Another simple one may be-

>>>re.findall(r'(\b\d+\b),\s*(\btext\d+\b)', rawtxt)
>>>[('1111', 'text1'), ('2222', 'text2'), ('3333', 'text3'), ('4444', 'text4'), ('5555', 'text5')]

Edit- If needed (for worse ordered data) use filter

filter(lambda x: len(x)>1,re.findall(r'(\b\d+\b),\s*(\btext\d+\b)', rawtxt))

SEE DEMO Live Demo

Upvotes: 1

dawg
dawg

Reputation: 103744

Given:

>>> txt='''\
... CustomerID:1111,
... 
... text1
... 
... CustomerID:2222,
... 
... text2
... 
... CustomerID:3333,
... 
... text3
... 
... CustomerID:4444,
... 
... text4
... 
... CustomerID:5555,
... 
... text5'''

You can do:

>>> [re.findall(r'^(\d+),\s+(.+)', block) for block in txt.split('CustomerID:') if block]
[[('1111', 'text1')], [('2222', 'text2')], [('3333', 'text3')], [('4444', 'text4')], [('5555', 'text5')]]

If it is multiline text, you can do:

>>> [re.findall(r'^(\d+),\s+([\s\S]+)', block) for block in txt.split('CustomerID:') if block]
[[('1111', 'text1\n\n')], [('2222', 'text2\n\n')], [('3333', 'text3\n\n')], [('4444', 'text4\n\n')], [('5555', 'text5')]]

Upvotes: 1

vks
vks

Reputation: 67968

re.findall(r'CustomerID:(\d+),\s*(.*?)\s*(?=CustomerID:|$)', rawtxt, re.DOTALL)

Findall returns only the groups. use a lookahead for stopping the non greedy quantifier.Its also suggested to use r or raw mode to specify your regexes.If you dont use lookahead then customerid for next match will be consumed and so next match will not present.Overlapping matches has to be removed by using lookahead which do not consume string

Upvotes: 2

Remi Guan
Remi Guan

Reputation: 22282

Actually no need regex here:

>>> with open('file') as f:
...     rawtxt = [i.strip() for i in f if i != '\n']
...     
>>> l = []
>>> for i in [rawtxt[i:i+2] for i in range(0, len(rawtxt), 2)]:
...     l.append((i[0][11:-1], i[1]))
...     
... 
>>> l
[('1111', 'text1'), ('2222', 'text2'), ('3333', 'text3'), ('4444', 'text4'), ('5
555', 'text5')]
>>> 

If you need 1111, 2222, etc. be int, use l.append((int(i[0][11:-1]), i[1])) instead of l.append((i[0][11:-1], i[1])).

Upvotes: 2

Related Questions