foosion
foosion

Reputation: 7918

Parsing xpath with python

I'm trying to parse a web page that contains this:

<table style="width: 100%; border-top: 1px solid black; border-bottom: 1px solid black;">
<tr>
 <td colspan="2"
     style="border-top: 1px solid black; border-bottom: 1px solid black; background-color: #f0ffd3;">February 20, 2015</td>
</tr>
<tr>
 <td style="border-top: 1px solid gray; font-weight: bold;">9:00 PM</td>
 <td style="border-top: 1px solid gray; font-weight: bold">14°F</td>
</tr>
<tr>
 <td style="border-bottom: 1px solid gray;">Clear<br />
  Precip:
  0 %<br />
                                Wind:
                    from the WSW at 6 mph
 </td>
 <td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
       style="border: 0px; padding: 0px 3px" /></td>
</tr>
<tr>
 <td style="border-top: 1px solid gray; font-weight: bold;">10:00 PM</td>
 <td style="border-top: 1px solid gray; font-weight: bold">13°F</td>
</tr>
<tr>
 <td style="border-bottom: 1px solid gray;">Clear<br />
  Precip:
  0 %<br />
                                Wind:
                    from the WSW at 6 mph
 </td>
 <td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
       style="border: 0px; padding: 0px 3px" /></td>
</tr>

(it continues with more rows and ends with [/table]

tree = html.fromstring(page)
table = tree.xpath('//table/tr')
for item in table:
    for elem in item.xpath('*'):
        if 'colspan' in html.tostring(elem):
                print '*', elem.text
        elif elem.text is not None:
            print elem.text,
        else:
            print 

somewhat works. It does not get the text following the [br /] and it's far from elegant. How do I get the missing text? In addition, any suggestions for improving the code would be appreciated.

Upvotes: 1

Views: 2404

Answers (1)

alecxe
alecxe

Reputation: 474161

How about using .text_content()?

.text_content(): Returns the text content of the element, including the text content of its children, with no markup.

table = tree.xpath('//table/tr')
for item in table:
    print ' '.join(item.text_content().split())

join()+split() here help to replace multiple spaces with a single one.

It prints:

February 20, 2015
9:00 PM 14°F
Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13°F
Clear Precip: 0 % Wind: from the WSW at 6 mph

Since you want to merge time-line with a precip-line, you can iterate over tr tags but skipping those containing Precip in the text. For every time-line, get the following tr sibling to get the precip-line:

table = tree.xpath('//table/tr[not(contains(., "Precip"))]')
for item in table:
    text = ' '.join(item.text_content().split())
    if 'AM' in text or 'PM' in text:
        text += ' ' + ' '.join(item.xpath('following-sibling::tr')[0].text_content().split())

    print text

Prints:

February 20, 2015
9:00 PM 14°F Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13°F Clear Precip: 0 % Wind: from the WSW at 6 mph

Upvotes: 2

Related Questions