Koosc2
Koosc2

Reputation: 65

Find all matches between two strings with regex

I am just starting to use regex for the first time and am trying to use it to parse some data from an HTML table. I am trying to grab everything between the <tr > and </tr> tags, and then make a similar regex again to create a JSON array.

I tried using this but it only is matching to the first group and not all of the rest.

<tr >(.*?)</tr>

How do I make that find all matches between those tags?

Upvotes: 1

Views: 932

Answers (2)

zx81
zx81

Reputation: 41838

Although using regex for this job is a bad idea (there are many ways for things to go wrong), your pattern is basically correct.

Returning All Matches with Python

The question then becomes about returning all matches or capture groups in Python. There are two basic ways:

  1. finditer
  2. findall

With finditer

for match in regex.finditer(subject):
    print("The Overall Match: ", match.group(0))
    print("Group 1: ", match.group(1))

With findall

findall is a bit strange. When you have capture groups, to access both the capture groups and the overall match, you have to wrap your original regex in parentheses (so that the overall match is captured too). In your case, if you wanted to be able to access both the outside of the tags and the inside (which you captured with Group 1), your regex would become: (<tr >(.*?)</tr>). Then you do:

matches = regex.findall(subject)
if len(matches)>0:
    for match in matches:
        print ("The Overall Match: ",match[0])
        print ("Group 1: ",match[1])

Upvotes: 1

Aaron Hall
Aaron Hall

Reputation: 394945

It works for me, perhaps you need to use findall, or perhaps you're not using a raw string?

import re

txt = '''<tr >foo</tr><tr >bar

</tr>

<tr >baz</tr>'''

# Be sure to use the DOTALL flag so the newlines are matched by the dot as well.
re.findall(r'<tr >(.*?)</tr>', txt, re.DOTALL)

returns

['foo', 'bar\n\n', 'baz']

Upvotes: 0

Related Questions