Benjamin Rasmussen
Benjamin Rasmussen

Reputation: 185

Scrapy - getting HTML without outer tag

I'm scraping a page, using Scrapy. I want the HTML contents of the TD with "text" class:

<tr valign="top">
  <td class="text" width="100%">
    <b>A bunch of HTML</b>

    <ul type="disc">
      <li>Some random text</li>
    </ul>
  </td>
</tr>

This is my Scrapy line:

for body in response.css('td.text'):
  yield {'body': body.extract()}

Which works - except it includes the surrounding td:

[
  {"body": "<td class="text" width="100%"> <b>A bunch of HTML</b> <ul type="disc"> <li>Some random text</li> </ul> </td>"}
]

This is what I want:

[
  {"body": "<b>A bunch of HTML</b> <ul type="disc"> <li>Some random text</li> </ul>"}
]

Halp? :)

Upvotes: 2

Views: 217

Answers (2)

Mohamed Yasser
Mohamed Yasser

Reputation: 812

Try this selector:

response.css('td.text *')

The * will select all inner tags.

Upvotes: 1

Benjamin Rasmussen
Benjamin Rasmussen

Reputation: 185

Well, I found a solution, although I still think there must be a smarter way:

    bodies = ''
    for body in response.xpath("//td[@class='text']/child::node()"):
        bodies += body.extract()
    yield {'body': bodies}

Upvotes: 0

Related Questions