Reputation: 553
I am new to scrapy and try to use it to extract the following data "name", "address", "state", "postal_code" from the sample html code below:
<div id="superheroes">
<table width="100%" border="0" ">
<tr>
<td valign="top">
<h2>Superheroes in New York</h2>
<hr/>
</td>
</tr>
<tr valign="top">
<td width="75%">
<h2>Peter Parker</h2>
<hr />
<table width="100%">
<tr valign="top">
<td width="13%" height="70" valign="top"><img src="/img/spidey.jpg"/></td>
<td width="87%" valign="top"><strong>Address:</strong> New York City<br/>
<strong>State:</strong>New York<br/>
<strong>Postal Code:</strong>12345<br/>
<strong>Telephone:</strong> 555-123-4567</td>
</tr>
<tr>
<td height="18" valign="top"> </td>
<td align="right" valign="top"><a href="spiderman"><strong>Read More</strong></a></td>
</tr>
</table>
<h2>Tony Stark</h2>
<hr />
<table width="100%" border="0" cellpadding="2" cellspacing="2" valign="top">
<tr valign="top">
<td width="13%" height="70" valign="top"><img src="/img/ironman.jpg"/></td>
<td width="87%" valign="top"><strong>Address:</strong> New York City<br/>
<strong>State:</strong> New York<br/>
<strong>Postal Code:</strong> 54321<br/>
<strong>Telephone:</strong> 555-987-6543</td>
</tr>
<tr>
<td height="18" valign="top"> </td>
<td align="right" valign="top"><a href="iron_man"><strong>Read More</strong></a></td>
</tr>
</table>
</td>
<td width="25%">
<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
</script>
</td>
</tr>
</table>
</div>
My superheroes.py contains the following code:
from scrapy.spider import CrawlSpider, Rule
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from superheroes.items import Superheroes
items = []
class MySpider(CrawlSpider):
name = "superheroes"
allowed_domains = ["www.somedomain.com"]
start_urls = ["http://www.somedomain.com/ny"]
rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]
def parse_item(self, response):
sel = Selector(response)
tables = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]')
for table in tables:
item = Superheroes()
item['name'] = table.xpath('h2/text()').extract()
item['address'] = table.xpath('/tr[1]/td[2]/strong[1]/text()').extract()
item['state'] = table.xpath('/tr[1]/td[2]/strong[2]/text()').extract()
item['postal_code'] = table.xpath('/tr[1]/td[2]/strong[3]/text()').extract()
items.append(item)
return items
And my items.py contains:
import scrapy
class Superheroes(scrapy.Item):
name = scrapy.Field()
address = scrapy.Field()
state = scrapy.Field()
postal_code = scrapy.Field()
When I ran "scrapy runspider superheroes.py -o super_db -t csv", the output file is empty.
Could anyone help me with any error in my code above?
Thanks so much for your help!
Upvotes: 0
Views: 2233
Reputation: 5092
There were two issues with your code. First, your parse_item
method did not seem to be indented (at least, that's how it looks in your question), and thus would not be included in the MySpider class. Every line in superheroes.py starting at def parse_item(self, response):
needs to have two spaces in front of it.
The second problem is that rules
is stating that parse_item
should be called for every link (i.e., SgmlLinkExtractor
) found in the page. You can see in the output that it tries to fetch /iron_man
and /spiderman
-- those are the pages whose output would be passed to parse_item
.
To process the start_urls
with your parse_item
, you need to rename it to parse_start_url
. If there is is only one page which you'll be processing, you can even get rid of the rules
! (see the documentation about parse_start_url
).
Your updated class looks like this (note that I also moved items
inside of the method; there is no need to declare it as a global):
class MySpider(CrawlSpider):
name = "superheroes"
allowed_domains = ["localhost"]
start_urls = ["http://localhost:8000/page.html"]
# indentation!
def parse_start_url(self, response):
sel = Selector(response)
headers = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]/h2')
for header in headers:
item = Superheroes()
item['name'] = header.xpath('text()')[0].extract()
table = header.xpath('following-sibling::table')
item['address'] = table.xpath('tr[1]/td[2]/strong[1]/following-sibling::text()')[0].extract().strip()
item['state'] = table.xpath('tr[1]/td[2]/strong[2]/following-sibling::text()')[0].extract().strip()
item['postal_code'] = table.xpath('tr[1]/td[2]/strong[3]/following-sibling::text()')[0].extract().strip()
yield item
Edit: Thanks to @Daniil Mashkin for pointing out that the original xpath expressions did not work. I corrected them in the code above. Cheers!
Upvotes: 1
Reputation: 5181
You should change your xpath expressions in for
cycle and yield
every item, instead of return
array
def parse_item(self, response):
sel = Selector(response)
tables = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]')
for name, data in zip(tables.xpath('./h2/text()'), tables.xpath('./table')):
item = Superheroes()
item['name'] = name.extract()
item['address'] = data.xpath('.//strong[1]/text()').extract()
item['state'] = data.xpath('.//strong[2]/text()').extract()
item['postal_code'] = data.xpath('.//strong[3]/text()').extract()
yield item
Upvotes: 1