Hauba
Hauba

Reputation: 75

Python regex help

I am trying to make a regex that finds all names, url and phone numbers in an html page. But I'm having trouble with the phone number part. I think the problem with the numbers part is that is searches until it finds the </strong> but in that process it skips people, instead of making a empty string if the person has no phone number ( simply put instead of a list like this: url1+name1+num1 | url2+name2+"" | url3+name3+num3 it returns a list like this: url1+name1+num1 | url2+name2+num3 , with url3+name3 deleted in the process)

for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page):

I am searchin for people in s single very long line. A person could have an url or phone number. An example of a person with an url and a phone number

 <tr>  <td class="lablinksName"><div><a href="/si/ivan-bratko/default.html"> dr. Ivan Bratko  akad. prof.</a></div></td>  <td class="lablinksMail"><a href="javascript:void(cmPopup('sendMessage', '/si/ivan-bratko/mailer.html', true, 350, 350));"><img src="/Static/images/gui/mail.gif" height="8" width="11"></a></td> <td class="lablinksPhone"><div><strong>T:</strong> +386  1 4768 393 </div></td> </tr>

And an example of a person with no url or phone number

 <tr>  <td class="lablinksName"><div> dr. Branko Matjaž  Jurič   prof.</div></td>  <td class="lablinksMail"><a href="javascript:void(cmPopup('sendMessage', '/si/branko-matjaz-juric/mailer.html', true, 350, 350));"><img src="/Static/images/gui/mail.gif" height="8" width="11"></a></td> <td class="lablinksPhone"><div> </div></td> </tr>

I hope i was clear enough and if any one can help me.

Upvotes: 0

Views: 221

Answers (4)

Brian
Brian

Reputation: 25824

The quick and dirty way to fix it:

Replace

for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page):

with

for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page.replace("<tr>","\n"):

The issue is that the the .*? in .*?</strong> can match strings containing td class="lablinksMail. It cannot match \n. Any time you use . in a Regex (rather than [^<]), this kind of annoyance tends to happen.

Upvotes: 0

Hugh Bothwell
Hugh Bothwell

Reputation: 56624

import lxml.html

root = lxml.html.parse("http://my.example.com/page.html").getroot()
rows = root.xpath("//table[@id='contactinfo']/tr")

for r in rows:
    nameText = r.xpath("td[@class='lablinksName']/div/text() | td[@class='lablinksName']/div/a/text()")
    name = u''.join(nameText).strip()

    urls = r.xpath("td[@class='lablinksName']/div/a/@href")
    url = len(urls)>0 and urls[0] or ''

    phoneText = r.xpath("td[@class='lablinksPhone']/div/text()")
    phone = u''.join(phoneText).strip()

    print name, url, phone

For the purpose of this code, I assume <table id="contactinfo">{your table rows}</table>.

Upvotes: 1

Paulo Scardine
Paulo Scardine

Reputation: 77251

Looks like a job for Beautiful Soup.

I love the quote: "You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."

Upvotes: 0

Jay Maynard K5ZC
Jay Maynard K5ZC

Reputation: 360

If you're having this kind of difficulty, it's usually a good sign you're using the wrong approach. In particular, if I were doing this via regexp, I wouldn't even try unless the line in question had the "<td class="lablinksPhone">" tag in it.

Upvotes: 0

Related Questions