Reputation: 20420
I'm trying to extract some data from webpage using Python 2.7.5.
code:
p = re.compile(r'.*<section\s*id="(.+)">(.+)</section>.*')
str = 'df <section id="1">2</section> fdd <section id="3">4</section> fd'
m = p.findall(str)
for eachentry in m:
print 'id=[{}], text=[{}]'.format(eachentry[0], eachentry[1])
output:
id=[3], text=[4]
why it's extracting only the last occurrence? if i remove the last occurrence the first one is found
Upvotes: 3
Views: 1900
Reputation: 2382
Your regular expression needs to be changed as follows:
p = re.compile(r'<section\s*id="(.+?)">(.+?)</section>')
Upvotes: 1
Reputation: 239493
The .*
at the beginning is very greedy and it consumes till the last occurrence. In fact all the .*
in the expression are very greedy. So, we make them non-greedy with ?
, like this
p = re.compile(r'.*?<section\s*id="(.+?)">(.+?)</section>.*?')
And the output becomes
id=[1], text=[2]
id=[3], text=[4]
In fact, you can drop the first and last .*
patterns and keep it simple like this
p = re.compile(r'<section\s*id="(.+?)">(.+?)</section>')
Upvotes: 6