goh
goh

Reputation: 29511

scrapy xpath help needed

I'm new to xpath so please bear with me. Currently, I'm looking to use scrapy to scrape some content off some webpages, and the content looks something like this:

     <td colspan="3" valign="top" class="regular">Landsize: 84,000sq with an extensive shoreline 750m<br />
<br />
Call Or Email for more info<br />

. Full-length Olympicpool,children pool,jacuzzi<br />
\' Landscapesdkey bridges<br />
. 2 tennis courts<br />
. water features True seafront development with iconic design by architect Daniel Libeskind<br />
lconic residential, located less than\' 150 metres from the shoreline<br />
<br />
opposite the future integrated resort on sentosa Island.<br />

A part of keppel Bay world calss water front precinct with luxury homes.<br />
<br />
Call or email for more info </td>

Specifically, I'm using the following hxs.select('//tr[contains(td,"Description")]/following-sibling::tr[1]/td/text()').extract()

However, doing this will break the resulting item into a list due to the content being separated by <br>. If I exclude text() from the xpath, the <td> element is included in the resulting string which is not desirable.

Is there a way in xpath to ensure my resulting string is everything that's shown above but without the td tags? I hope I do not need to manually join back the list by <br/>

Upvotes: 0

Views: 1117

Answers (3)

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243479

Judging from your commentto Evan's correct answer, you want to skip the NLs.

In this case, try:

normalize-space(//tr[contains(td,"Description")]/following-sibling::tr[1]/td)

Note:

  1. If the argument to normalize-space() selects more than one node, this function will return the result of processing only the first selected node.

  2. All leading and trailing white-space characters are deleted. All intermediate groups of adjacent white-space characters are replaced by a single space character.

Upvotes: 3

Steve Wellens
Steve Wellens

Reputation: 20620

You might find the HTML Agility Pack useful for parsing web pages.

Upvotes: 0

Evan Lenz
Evan Lenz

Reputation: 4126

Try wrapping your expression in a call to string(), which returns the string-value of a node, which is the concatenation of all the string-values of the node's descendant text nodes.

string(//tr[contains(td,"Description")]/following-sibling::tr[1]/td)

Upvotes: 0

Related Questions