Reputation: 185
I'm scraping a page, using Scrapy. I want the HTML contents of the TD with "text" class:
<tr valign="top">
<td class="text" width="100%">
<b>A bunch of HTML</b>
<ul type="disc">
<li>Some random text</li>
</ul>
</td>
</tr>
This is my Scrapy line:
for body in response.css('td.text'):
yield {'body': body.extract()}
Which works - except it includes the surrounding td:
[
{"body": "<td class="text" width="100%"> <b>A bunch of HTML</b> <ul type="disc"> <li>Some random text</li> </ul> </td>"}
]
This is what I want:
[
{"body": "<b>A bunch of HTML</b> <ul type="disc"> <li>Some random text</li> </ul>"}
]
Halp? :)
Upvotes: 2
Views: 217
Reputation: 812
Try this selector:
response.css('td.text *')
The *
will select all inner tags.
Upvotes: 1
Reputation: 185
Well, I found a solution, although I still think there must be a smarter way:
bodies = ''
for body in response.xpath("//td[@class='text']/child::node()"):
bodies += body.extract()
yield {'body': bodies}
Upvotes: 0