Massagran
Massagran

Reputation: 1821

scrapy with newline characters and nested tags

Disclaimer: new to scrapy.

I have a table with pretty irregular rows, The basic structure is:

<tr>
 <td> some text </td>
 <td> some other text </td>
 <td> yet some text </td>
</tr>

but occasionally (a few hundred times) some rows are

<tr>
 <td> <p> some text <p> </td>
 <td> <div class="class-whateva"> <p> some other text </p></div> </td>
 <td> <span id="strange-id"> 
  <a href="somelink"> yet some text </a> 
    <span> </td>
</tr>

or other permutations of 1 or 2 nested "p" "div" and "span" with or without return line characters.

I've taken care of the nested "span span" or "p div" or "div span" with conditional statements of the form:

for row in allrows:
      if  row.select('td[2]/text()'):
            item['seconditem']=row.select('td[2]/text()').extract()
      elif row.select('td[2]/*/text()'):
            item['seconditem']=row.select('td[2]/*/text()').extract()
      elif row.select('td[2]/*/*/text()'):
            item['seconditem']=row.select('td[2]/*/*/text()').extract()

Now I have two questions:

(1) Is conditional

td[2]/*/*/text()

the right way to go about for irregular nested rows?

(2) I am still missing all the cases where there is a return (or newline) before the tag. So if the row is of the form:

   <td><div>
      <p>text </p>
   </div></td>

All my xpath will return is a ['\n ']. Any trick to catch what's after the newline character?

Any tips appreciated. Thanks.

Upvotes: 2

Views: 3153

Answers (1)

reclosedev
reclosedev

Reputation: 9522

You can use string() function in XPath expression to get all inner text nodes in one string:

# nested.html - your second html snippet
# $scrapy shell "nested.html" 

In [1]: row = hxs.select('//tr')

In [2]: row.select('td[2]').select('string()').extract()
Out[2]: [u'   some other text  ']

In [3]: row.select('td[2]').select('string()').extract()[0]
Out[3]: u'   some other text  '

In [4]: row.select('td[3]').select('string()').extract()[0]
Out[4]: u'  \r\n   yet some text  \r\n     '

Or //text() to get all inner text nodes:

In [5]: row.select('td[3]//text()').extract()
Out[5]: [u' \r\n  ', u' yet some text ', u' \r\n    ', u' ']

And ''.join(...) to get string:

In [6]: ''.join(row.select('td[3]//text()').extract())
Out[6]: u' \r\n   yet some text  \r\n     '

Upvotes: 3

Related Questions