M.Ridha
M.Ridha

Reputation: 553

Using scrapy to extract multiple data in table td elements

I am new to scrapy and try to use it to extract the following data "name", "address", "state", "postal_code" from the sample html code below:

<div id="superheroes">
<table width="100%" border="0" ">
  <tr>
  <td valign="top">
  <h2>Superheroes in New York</h2>
  <hr/>
  </td>
  </tr>
  <tr valign="top">
    <td width="75%">                    
      <h2>Peter Parker</h2>
      <hr />
      <table width="100%">
        <tr valign="top">
          <td width="13%" height="70" valign="top"><img src="/img/spidey.jpg"/></td>
          <td width="87%" valign="top"><strong>Address:</strong> New York City<br/>
            <strong>State:</strong>New York<br/>
            <strong>Postal Code:</strong>12345<br/>
            <strong>Telephone:</strong> 555-123-4567</td>
        </tr>
        <tr>
          <td height="18" valign="top">&nbsp;</td>
          <td align="right" valign="top"><a href="spiderman"><strong>Read More</strong></a></td>
        </tr>
      </table>
      <h2>Tony Stark</h2>
      <hr />
      <table width="100%" border="0" cellpadding="2" cellspacing="2" valign="top">
        <tr valign="top">
          <td width="13%" height="70" valign="top"><img src="/img/ironman.jpg"/></td>
          <td width="87%" valign="top"><strong>Address:</strong> New York City<br/>
            <strong>State:</strong> New York<br/>
            <strong>Postal Code:</strong> 54321<br/>
            <strong>Telephone:</strong> 555-987-6543</td>
        </tr>
        <tr>
          <td height="18" valign="top">&nbsp;</td>
          <td align="right" valign="top"><a href="iron_man"><strong>Read More</strong></a></td>
        </tr>
      </table>
    </td>
    <td width="25%">
       <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
       </script>    
    </td>
  </tr>
</table>
</div>

My superheroes.py contains the following code:

from scrapy.spider import CrawlSpider, Rule
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from superheroes.items import Superheroes

items = []

class MySpider(CrawlSpider):
  name = "superheroes"
  allowed_domains = ["www.somedomain.com"]
  start_urls = ["http://www.somedomain.com/ny"]
  rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]

   def parse_item(self, response):
     sel = Selector(response)
     tables = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]')
     for table in tables:
        item = Superheroes()
        item['name'] = table.xpath('h2/text()').extract()
        item['address'] = table.xpath('/tr[1]/td[2]/strong[1]/text()').extract()
        item['state'] = table.xpath('/tr[1]/td[2]/strong[2]/text()').extract()
        item['postal_code'] = table.xpath('/tr[1]/td[2]/strong[3]/text()').extract()
        items.append(item)
     return items

And my items.py contains:

import scrapy
class Superheroes(scrapy.Item):
    name = scrapy.Field()
    address = scrapy.Field()
    state = scrapy.Field()
    postal_code = scrapy.Field()    

When I ran "scrapy runspider superheroes.py -o super_db -t csv", the output file is empty.

Could anyone help me with any error in my code above?

Thanks so much for your help!

Upvotes: 0

Views: 2233

Answers (2)

Greg Sadetsky
Greg Sadetsky

Reputation: 5092

There were two issues with your code. First, your parse_item method did not seem to be indented (at least, that's how it looks in your question), and thus would not be included in the MySpider class. Every line in superheroes.py starting at def parse_item(self, response): needs to have two spaces in front of it.

The second problem is that rules is stating that parse_item should be called for every link (i.e., SgmlLinkExtractor) found in the page. You can see in the output that it tries to fetch /iron_man and /spiderman -- those are the pages whose output would be passed to parse_item.

To process the start_urls with your parse_item, you need to rename it to parse_start_url. If there is is only one page which you'll be processing, you can even get rid of the rules! (see the documentation about parse_start_url).

Your updated class looks like this (note that I also moved items inside of the method; there is no need to declare it as a global):

class MySpider(CrawlSpider):
  name = "superheroes"
  allowed_domains = ["localhost"]
  start_urls = ["http://localhost:8000/page.html"]

  # indentation!
  def parse_start_url(self, response):
    sel = Selector(response)
    headers = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]/h2')
    for header in headers:
      item = Superheroes()

      item['name'] = header.xpath('text()')[0].extract()

      table = header.xpath('following-sibling::table')
      item['address'] = table.xpath('tr[1]/td[2]/strong[1]/following-sibling::text()')[0].extract().strip()
      item['state'] = table.xpath('tr[1]/td[2]/strong[2]/following-sibling::text()')[0].extract().strip()
      item['postal_code'] = table.xpath('tr[1]/td[2]/strong[3]/following-sibling::text()')[0].extract().strip()

      yield item

Edit: Thanks to @Daniil Mashkin for pointing out that the original xpath expressions did not work. I corrected them in the code above. Cheers!

Upvotes: 1

Danil
Danil

Reputation: 5181

You should change your xpath expressions in for cycle and yield every item, instead of return array

def parse_item(self, response):
    sel = Selector(response)
    tables = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]')
    for name, data in zip(tables.xpath('./h2/text()'), tables.xpath('./table')):
        item = Superheroes()
        item['name'] = name.extract()
        item['address'] = data.xpath('.//strong[1]/text()').extract()
        item['state'] = data.xpath('.//strong[2]/text()').extract()
        item['postal_code'] = data.xpath('.//strong[3]/text()').extract()
        yield item

Upvotes: 1

Related Questions