Reputation: 1432
I'm using Scrapy to scrape a website, extract data from various pages, and then store that scraped data into a list. I am trying to scrape a name, url, and location from the first page, then from that scraped url traverse to that page and scrape the DOCTYPE from that page. See below for the code that I have crafted, I have been following this documentation closely but I am getting weird results.
If I don't try to use a second method within my ExampleSpider I get back a list of 3000+ results, which is exactly what I want...minus that essential second piece of data. When I try to include this method all I get back are the starting URL's (ie. http://www.example.com, http://www.example1.com, etc.).
Any suggestions as to what I am doing wrong here?
import scrapy
from scrapy.contrib.loader import ItemLoader
from example.items import Item
from scrapy.http import Request
import re
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["list of about 15 different websites (ie. 'http://example.com', 'http://example1.com')"]
# #Working method to scrape all of the data I need (except DOCTYPE from second page)
# def parse(self, response):
# for sel in response.xpath('//table[@class="rightLinks"]/tr'):
# item = Item()
# item['company_name'] = sel.xpath('td[1]/a/text()').extract()
# item['website'] = sel.xpath('td[1]/a/@href').extract()
# item['location'] = sel.xpath('td[2]/text()').extract()
# yield item
def parse(self, response):
for sel in response.xpath('//table[@class="rightLinks"]/tr'):
item = PharmaItem()
item['company_name'] = sel.xpath('td[1]/a/text()').extract()
item['website'] = sel.xpath('td[1]/a/@href').extract()
item['location'] = sel.xpath('td[2]/text()').extract()
#Setting up a new request to pass to the get_DT method, also passing along the 'item' class meta data
#converting website from list item to string
website = ''.join(item['website'])
request = scrapy.Request(website, callback=self.get_DT)
request.meta['item'] = item
return request
#Get DOCTYPE from each page
def get_DT(self, response):
item = response.meta['item']
item['website'] = response.url
dtype = re.search("<!\s*doctype\s*(.*?)>", response.body, re.IGNORECASE)
item['DOCTYPE'] = dtype
yield item
UPDATE These are the two final functions that I used that worked. I took offwhitelotus's suggestion and tried it but this didn't work as it kept returning the parent DOCTYPE and not the traversed page DOCTYPE.
def parse(self, response):
for sel in response.xpath('//table[@class="rightLinks"]/tr'):
item = PharmaItem()
item['company_name'] = sel.xpath('td[1]/a/text()').extract()
website = sel.xpath('td[1]/a/@href').extract()[0]
item['location'] = sel.xpath('td[2]/text()').extract()
# Setting up a new request to pass to the get_DT method, also passing along the 'item' class meta data
request = scrapy.Request(website, callback=self.get_DT)
request.meta['item'] = item
yield request
#Takes in the websites that were crawled from previous method and finds DOCTYPES
def get_DT(self, response):
item = response.meta['item']
item['DOCTYPE'] = response.selector._root.getroottree().docinfo.doctype
item['website'] = response.url
yield item
Upvotes: 3
Views: 4453
Reputation: 1079
I agree with Lawrence about fixing that line.
Also, I'm not sure why you have that callback function. You can get the doctype easily with:
import requests
html = requests.get('http://something.com').content
dtype = re.search("<!\s*doctype\s*(.*?)>", html, re.IGNORECASE)
I've never used scrapy.Requests
, so I just used good old requests
here.
Upvotes: 1
Reputation: 8614
You are running a loop, but calling return
in it. It will prevent the loop from going through all the links. Use yield
instead, in the parse()
function.
Other than that, I don't get this part:
#converting website from list item to string
website = ''.join(item['website'])
That just seems wrong. If there are multiple web URL's there, this will result in a very bad and invalid URL. And if there is just one of them, then you should have collect it by getting the first and only list element (note the [0]
at the end):
item['website'] = sel.xpath('td[1]/a/@href').extract()[0]
Also, I'm not sure why you are setting the item['website']
in the parse()
function, since you are going to override it in the get_DT
function anyway. You should just use a temporary variable, like so:
for sel in response.xpath('//table[@class="rightLinks"]/tr'):
item = PharmaItem()
item['company_name'] = sel.xpath('td[1]/a/text()').extract()
item['location'] = sel.xpath('td[2]/text()').extract()
website = sel.xpath('td[1]/a/@href').extract()
request = scrapy.Request(website, callback=self.get_DT)
request.meta['item'] = item
yield request
Upvotes: 1