Reputation: 75
I am trying to make a regex that finds all names, url and phone numbers in an html page.
But I'm having trouble with the phone number part. I think the problem with the numbers part is that is searches until it finds the </strong>
but in that process it skips people, instead of making a empty string if the person has no phone number ( simply put instead of a list like this: url1+name1+num1 | url2+name2+"" | url3+name3+num3
it returns a list like this: url1+name1+num1 | url2+name2+num3
, with url3+name3
deleted in the process)
for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page):
I am searchin for people in s single very long line. A person could have an url or phone number. An example of a person with an url and a phone number
<tr> <td class="lablinksName"><div><a href="/si/ivan-bratko/default.html"> dr. Ivan Bratko akad. prof.</a></div></td> <td class="lablinksMail"><a href="javascript:void(cmPopup('sendMessage', '/si/ivan-bratko/mailer.html', true, 350, 350));"><img src="/Static/images/gui/mail.gif" height="8" width="11"></a></td> <td class="lablinksPhone"><div><strong>T:</strong> +386 1 4768 393 </div></td> </tr>
And an example of a person with no url or phone number
<tr> <td class="lablinksName"><div> dr. Branko Matjaž Jurič prof.</div></td> <td class="lablinksMail"><a href="javascript:void(cmPopup('sendMessage', '/si/branko-matjaz-juric/mailer.html', true, 350, 350));"><img src="/Static/images/gui/mail.gif" height="8" width="11"></a></td> <td class="lablinksPhone"><div> </div></td> </tr>
I hope i was clear enough and if any one can help me.
Upvotes: 0
Views: 221
Reputation: 25824
The quick and dirty way to fix it:
Replace
for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page):
with
for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page.replace("<tr>","\n"):
The issue is that the the .*?
in .*?</strong>
can match strings containing td class="lablinksMail
. It cannot match \n
. Any time you use .
in a Regex (rather than [^<]
), this kind of annoyance tends to happen.
Upvotes: 0
Reputation: 56624
import lxml.html
root = lxml.html.parse("http://my.example.com/page.html").getroot()
rows = root.xpath("//table[@id='contactinfo']/tr")
for r in rows:
nameText = r.xpath("td[@class='lablinksName']/div/text() | td[@class='lablinksName']/div/a/text()")
name = u''.join(nameText).strip()
urls = r.xpath("td[@class='lablinksName']/div/a/@href")
url = len(urls)>0 and urls[0] or ''
phoneText = r.xpath("td[@class='lablinksPhone']/div/text()")
phone = u''.join(phoneText).strip()
print name, url, phone
For the purpose of this code, I assume <table id="contactinfo">{your table rows}</table>.
Upvotes: 1
Reputation: 77251
Looks like a job for Beautiful Soup.
I love the quote: "You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."
Upvotes: 0
Reputation: 360
If you're having this kind of difficulty, it's usually a good sign you're using the wrong approach. In particular, if I were doing this via regexp, I wouldn't even try unless the line in question had the "<td class="lablinksPhone">
" tag in it.
Upvotes: 0