Reputation: 523
I have this text:
<div class="additional-details">
<div class="mark-container">
<input type="checkbox" id="comp-80174649" value="80174649"
data-heading-code="2550"/>
<label for="comp-80174649">???</label>
<a href="#" class="compare-link" id="compare-link-1"
data-compare="/80174649/2550/"
data-drop-down-id="compare-content-1"
data-drop-down-content-id="compare-content"
data-drop-down-class="drop-down-compare"
etc...
data-compare="/8131239/2550/"
I am trying to scrape what is inside data-compare="HERE" (I have multiple matches).
I know how to do this in C#, using a MatchCollection, but in python I am pretty confused with re.search, re.match and also I've noticed that the regex that is working in C# is not really working in python.
Could somebody explain how to get this done ?
Upvotes: 2
Views: 681
Reputation: 1836
re.findall
can be used to find all the matches in a list.
>>> import re
>>> s = '<div cla' # whole string here
>>> result = re.findall('data-compare="([\d/]+)"', s)
>>> print result
['/80174649/2550/', '/8131239/2550/']
Explanation
The desired output like '/80174649/2550/'
has only numbers and forward slash, so we'll be only targeting that.
In ([\d/]+)
, [\d/]
means match either a number (signified by \d
) or forward slash /
.
Then the +
symbol means that the preceding pattern [\d/]
can occur multiple times since we do have multiple numbers and /
.
The enclosing parentheses means that the enclosed pattern [\d/]+
should only be captured and returned.
Upvotes: 1