Reputation: 65
I am just starting to use regex for the first time and am trying to use it to parse some data from an HTML table. I am trying to grab everything between the <tr >
and </tr>
tags, and then make a similar regex again to create a JSON array.
I tried using this but it only is matching to the first group and not all of the rest.
<tr >(.*?)</tr>
How do I make that find all matches between those tags?
Upvotes: 1
Views: 932
Reputation: 41838
Although using regex for this job is a bad idea (there are many ways for things to go wrong), your pattern is basically correct.
Returning All Matches with Python
The question then becomes about returning all matches or capture groups in Python. There are two basic ways:
With finditer
for match in regex.finditer(subject):
print("The Overall Match: ", match.group(0))
print("Group 1: ", match.group(1))
With findall
findall
is a bit strange. When you have capture groups, to access both the capture groups and the overall match, you have to wrap your original regex in parentheses (so that the overall match is captured too). In your case, if you wanted to be able to access both the outside of the tags and the inside (which you captured with Group 1), your regex would become: (<tr >(.*?)</tr>)
. Then you do:
matches = regex.findall(subject)
if len(matches)>0:
for match in matches:
print ("The Overall Match: ",match[0])
print ("Group 1: ",match[1])
Upvotes: 1
Reputation: 394945
It works for me, perhaps you need to use findall
, or perhaps you're not using a raw string?
import re
txt = '''<tr >foo</tr><tr >bar
</tr>
<tr >baz</tr>'''
# Be sure to use the DOTALL flag so the newlines are matched by the dot as well.
re.findall(r'<tr >(.*?)</tr>', txt, re.DOTALL)
returns
['foo', 'bar\n\n', 'baz']
Upvotes: 0