Reputation: 29511
I'm new to xpath so please bear with me. Currently, I'm looking to use scrapy to scrape some content off some webpages, and the content looks something like this:
<td colspan="3" valign="top" class="regular">Landsize: 84,000sq with an extensive shoreline 750m<br />
<br />
Call Or Email for more info<br />
. Full-length Olympicpool,children pool,jacuzzi<br />
\' Landscapesdkey bridges<br />
. 2 tennis courts<br />
. water features True seafront development with iconic design by architect Daniel Libeskind<br />
lconic residential, located less than\' 150 metres from the shoreline<br />
<br />
opposite the future integrated resort on sentosa Island.<br />
A part of keppel Bay world calss water front precinct with luxury homes.<br />
<br />
Call or email for more info </td>
Specifically, I'm using the following hxs.select('//tr[contains(td,"Description")]/following-sibling::tr[1]/td/text()').extract()
However, doing this will break the resulting item into a list due to the content being separated by <br>
. If I exclude text()
from the xpath, the <td>
element is included in the resulting string which is not desirable.
Is there a way in xpath to ensure my resulting string is everything that's shown above but without the td tags? I hope I do not need to manually join back the list by <br/>
Upvotes: 0
Views: 1117
Reputation: 243479
Judging from your commentto Evan's correct answer, you want to skip the NLs.
In this case, try:
normalize-space(//tr[contains(td,"Description")]/following-sibling::tr[1]/td)
Note:
If the argument to normalize-space()
selects more than one node, this function will return the result of processing only the first selected node.
All leading and trailing white-space characters are deleted. All intermediate groups of adjacent white-space characters are replaced by a single space character.
Upvotes: 3
Reputation: 20620
You might find the HTML Agility Pack useful for parsing web pages.
Upvotes: 0
Reputation: 4126
Try wrapping your expression in a call to string(), which returns the string-value of a node, which is the concatenation of all the string-values of the node's descendant text nodes.
string(//tr[contains(td,"Description")]/following-sibling::tr[1]/td)
Upvotes: 0