Reputation: 1035

Confused on this regular expression pattern in Python

I wanna find 6 digit in my webpage:

<td style="width:40px;">705214</td>

My code is:

s = f.read()
m = re.search(r'\A>\d{6}\Z<', s)
l = m.group(0)

Upvotes: 2

Answers (4)

Paul Karlin

Reputation: 840

I think you want something like this:

m = re.search(r'>(\d{6})<', s)
l = m.group(1)

The ( ) around \d{6} indicate a subgroup of the result.

If you want to find multiple instances of 6-digit substrings between > and < then try this:

s = '<tag1>111111</tag1> <tag2>222222</tag2>'
m = re.findall(r'>(\d{6})<', s)

In this case, m will be ['111111','222222'].

Upvotes: 1

Rich

Reputation: 183

You may want to check for any whitespace (tabs, space, newlines) between the tags. \s* means zero or more whitespace.

s='<td style="width:40px;">\n\n705214\t\n</td>'
m=re.search(r'>\s*(\d{6})\s*<',s)
m.groups()
('705214',)

Parsing HTML is a blast. Usually you treat the file as one long line, remove leading and trailing whitespace between the values contained inside the tags. Maybe looking into a HTML table parsing module may help, especially if you need to parse several columns.

stackoverflow answer using lxml etree Also, htmp.parser was suggested. Food for thought. (Still learning what modules python has to offer :) )

Upvotes: 1

Sufian Latif

Reputation: 13356

You can also use a look-ahead and a look-behind for the checking:

m = re.search(r'(?<=>)\d{6}(?=<)', s)
l = m.group(0)

This regex will match to 6 digits that are preceded by a > and followed by a <.

Upvotes: 1

David Robinson

Reputation: 78630

If you just want to find 6 digits in between a > and < symbol, use the following regex:

import re
s = '<td style="width:40px;">705214</td>'
m = re.search(r'>(\d{6})<', s)
l = m.groups()[0]

Note the use of parentheses ( and ) to denote a capturing group.

Upvotes: 2

Confused on this regular expression pattern in Python

Answers (4)

Related Questions