LaGuille
LaGuille

Reputation: 1728

Combining two REGEX in Python for compiling

I'm using REGEX to compile a list of strings from an HTML document in Python. The strings are either found inside a td tag (<td>SOME OF THE STRINGS COULD BE HERE</td>) or inside a div tag (<div style="line-height: 100%;margin:0;padding:0;">SOME STRINGS COULD ALSO BE HERE</div>).

Since the order of the strings inside the final list should correspond to the order in which they appear inside the HTML document, I am looking for a REGEX that will allow me to compile all of these strings considering both possible cases.

I know how to do it individually with something that looks like:

FindStrings = re.compile('(?<=\<td>)(.*?)(?=\</td>)')
MyList = re.findall(FindStrings, str(mydocument))

for the first case, but would like to know the most efficient way to combine both cases inside a unique REGEX.

Upvotes: 1

Views: 2046

Answers (2)

Federico Piazza
Federico Piazza

Reputation: 31035

You can combine regex pattern by using regex OR. Btw, you don't need to use regex lookarounds.

You can use this regex:

<td>(.+?)</td>|<div.*?>(.+?)</div>

Working demo

enter image description here

Match information

MATCH 1
1.  [4-37]  `SOME OF THE STRINGS COULD BE HERE`
MATCH 2
2.  [94-125]    `SOME STRINGS COULD ALSO BE HERE`
QUICK REFERENCE

Code:

>>> import re
>>> s = """<td>SOME OF THE STRINGS COULD BE HERE</td>
... <div style="line-height: 100%;margin:0;padding:0;">SOME STRINGS COULD ALSO BE HERE</div>
... """
>>> m = re.findall(r'<td>(.+?)</td>|<div.*?>(.+?)</div>', s)
>>> m
[('SOME OF THE STRINGS COULD BE HERE', ''), ('', 'SOME STRINGS COULD ALSO BE HERE')]
>>> [s for s in x if s for x in m]
['SOME STRINGS COULD ALSO BE HERE', 'SOME STRINGS COULD ALSO BE HERE']

Upvotes: 1

vks
vks

Reputation: 67978

<td[^>]*>((?:(?!<\/td>).)*)<\/td>|<div[^>]*>((?:(?!<\/div>).)*)<\/div>

You can try this.See demo.

http://regex101.com/r/mD7gK4/11

Upvotes: 0

Related Questions