Can't get a regex pattern to work in Python

Question

I have the following (repeating) HTML text from which I need to extract some values using Python and regular expressions.


Demand No

I can get the first value by using

match_det = re.compile(r'(.+?)').findall(html_source_det)

But the above is on one line. However, I also need to get the second value which is on the line following the first one but I cannot get it to work. I have tried the following, but I won't get a match

match_det = re.compile('(.+?)
'
                       '').findall(html_source_det)

Perhaps I am unable to get it to work since the text is multiline, but I added " " at the end of the first line, so I thought this would resolve it but it did not.

What I am doing wrong?

The html_source is retrieved downloading it (it is not a static HTML file like outlined above - I only put it here so you could see the text). Maybe this is not the best way in getting the source.

I am obtaining the html_source like this:

new_url = "https://webaccess.site.int/curracc/" + url_details #not a real url
myresponse_det = urllib2.urlopen(new_url)
html_source_det = myresponse_det.read()

heinst · Accepted Answer

Please do not try to parse HTML with regex, as it is not regular. Instead use an HTML parsing library like BeautifulSoup. It will make your life a lot easier! Here is an example with BeautifulSoup:

from bs4 import BeautifulSoup

html = '''
Demand No

'''

soup = BeautifulSoup(html)
print soup.find('td', attrs={'width': '65%'}).findNext('input')['value']

Or more simply:

print soup.find('input', attrs={'name': 'T1'})['value']

Can't get a regex pattern to work in Python

Answers (1)

Related Questions

Can&#39;t get a regex pattern to work in Python

Answers (1)

Related Questions

Can't get a regex pattern to work in Python