Reputation: 1821
Disclaimer: new to scrapy.
I have a table with pretty irregular rows, The basic structure is:
<tr>
<td> some text </td>
<td> some other text </td>
<td> yet some text </td>
</tr>
but occasionally (a few hundred times) some rows are
<tr>
<td> <p> some text <p> </td>
<td> <div class="class-whateva"> <p> some other text </p></div> </td>
<td> <span id="strange-id">
<a href="somelink"> yet some text </a>
<span> </td>
</tr>
or other permutations of 1 or 2 nested "p" "div" and "span" with or without return line characters.
I've taken care of the nested "span span" or "p div" or "div span" with conditional statements of the form:
for row in allrows:
if row.select('td[2]/text()'):
item['seconditem']=row.select('td[2]/text()').extract()
elif row.select('td[2]/*/text()'):
item['seconditem']=row.select('td[2]/*/text()').extract()
elif row.select('td[2]/*/*/text()'):
item['seconditem']=row.select('td[2]/*/*/text()').extract()
Now I have two questions:
(1) Is conditional
td[2]/*/*/text()
the right way to go about for irregular nested rows?
(2) I am still missing all the cases where there is a return (or newline) before the tag. So if the row is of the form:
<td><div>
<p>text </p>
</div></td>
All my xpath will return is a ['\n ']. Any trick to catch what's after the newline character?
Any tips appreciated. Thanks.
Upvotes: 2
Views: 3153
Reputation: 9522
You can use string()
function in XPath
expression to get all inner text nodes in one string:
# nested.html - your second html snippet
# $scrapy shell "nested.html"
In [1]: row = hxs.select('//tr')
In [2]: row.select('td[2]').select('string()').extract()
Out[2]: [u' some other text ']
In [3]: row.select('td[2]').select('string()').extract()[0]
Out[3]: u' some other text '
In [4]: row.select('td[3]').select('string()').extract()[0]
Out[4]: u' \r\n yet some text \r\n '
Or //text()
to get all inner text
nodes:
In [5]: row.select('td[3]//text()').extract()
Out[5]: [u' \r\n ', u' yet some text ', u' \r\n ', u' ']
And ''.join(...)
to get string:
In [6]: ''.join(row.select('td[3]//text()').extract())
Out[6]: u' \r\n yet some text \r\n '
Upvotes: 3