SIM
SIM

Reputation: 22440

Unable to create appropriate selectors to parse some information

I have written a script in python using css selector to parse some names and phone numbers from a webpage. The script I have created is not giving me the results I expect ; rather, some information that i don't want are also coming along. How to rectify my selectors so that it will uniquely parse only the name and the phone number and nothing else. For your consideration I've pasted a link containing html elements at the bottom. Thanks in advance.

Here is what I've written:

from lxml.html import fromstring
root = fromstring(html)
for tree in root.cssselect(".cbFormTableEvenRow"):
    try:
        name = tree.cssselect(".cbFormDataCell span.cbFormData")[0].text
    except:
        name = ""
    try:
        phone = tree.cssselect(".cbFormLabel:contains('Phone Number')+td.cbFormDataCell .cbFormData")[0].text
    except:
        phone = ""
    print(name,phone)

Results I expect:

JAYMES CARTER (402)499-8846

Results I'm getting:

1840390831 
RESIDENTIAL 
JAYMES CARTER (402)499-8846
  
None 
My valuation jumped by almost $60,000 in one year. There are multiple comparable properties nearby that are much lower than my $194,300 evaluation, and a lot closer to my 2016 year evaluation of $134,400. 

Link to the html file:

https://www.dropbox.com/s/64apg5cjpssd3hb/html_table.html?dl=0

Upvotes: 0

Views: 51

Answers (1)

Bill Bell
Bill Bell

Reputation: 21663

Find the tr element that is the grandparent of the span whose text is 'Phone Number'. From there, get the td elements of the desired items and follow the hierarchy down from these to their texts.

>>> from lxml.html import fromstring
>>> root = fromstring(open('html_table.html').read())
>>> grand_parent = root.xpath('.//td[contains(text(),"Phone Number")]/..')[0]
>>> grand_parent.xpath('td[1]/span/text()')[0]
'JAYMES CARTER'
>>> grand_parent.xpath('td[5]/span/text()')[0]
'(402)499-8846'

Addendum in response to comment:

>>> items = grand_parent.xpath('.//span[@class="cbFormData"]/text()')
['JAYMES CARTER', '\xa0', '(402)499-8846']
>>> items = grand_parent.xpath('.//span[@class="cbFormData"]/text()')
>>> [_.replace('\xa0', '').strip() for _ in items]
['JAYMES CARTER', '', '(402)499-8846']

Upvotes: 1

Related Questions