Reputation: 1728
I'm using REGEX to compile a list of strings from an HTML document in Python.
The strings are either found inside a td tag (<td>SOME OF THE STRINGS COULD BE HERE</td>
) or inside a div tag (<div style="line-height: 100%;margin:0;padding:0;">SOME STRINGS COULD ALSO BE HERE</div>
).
Since the order of the strings inside the final list should correspond to the order in which they appear inside the HTML document, I am looking for a REGEX that will allow me to compile all of these strings considering both possible cases.
I know how to do it individually with something that looks like:
FindStrings = re.compile('(?<=\<td>)(.*?)(?=\</td>)')
MyList = re.findall(FindStrings, str(mydocument))
for the first case, but would like to know the most efficient way to combine both cases inside a unique REGEX.
Upvotes: 1
Views: 2046
Reputation: 31035
You can combine regex pattern by using regex OR. Btw, you don't need to use regex lookarounds.
You can use this regex:
<td>(.+?)</td>|<div.*?>(.+?)</div>
Match information
MATCH 1
1. [4-37] `SOME OF THE STRINGS COULD BE HERE`
MATCH 2
2. [94-125] `SOME STRINGS COULD ALSO BE HERE`
QUICK REFERENCE
Code:
>>> import re
>>> s = """<td>SOME OF THE STRINGS COULD BE HERE</td>
... <div style="line-height: 100%;margin:0;padding:0;">SOME STRINGS COULD ALSO BE HERE</div>
... """
>>> m = re.findall(r'<td>(.+?)</td>|<div.*?>(.+?)</div>', s)
>>> m
[('SOME OF THE STRINGS COULD BE HERE', ''), ('', 'SOME STRINGS COULD ALSO BE HERE')]
>>> [s for s in x if s for x in m]
['SOME STRINGS COULD ALSO BE HERE', 'SOME STRINGS COULD ALSO BE HERE']
Upvotes: 1
Reputation: 67978
<td[^>]*>((?:(?!<\/td>).)*)<\/td>|<div[^>]*>((?:(?!<\/div>).)*)<\/div>
You can try this.See demo.
http://regex101.com/r/mD7gK4/11
Upvotes: 0