Joey Orlando
Joey Orlando

Reputation: 1432

Scrapy - Parsing Data from Multiple Pages

I'm using Scrapy to scrape a website, extract data from various pages, and then store that scraped data into a list. I am trying to scrape a name, url, and location from the first page, then from that scraped url traverse to that page and scrape the DOCTYPE from that page. See below for the code that I have crafted, I have been following this documentation closely but I am getting weird results.

If I don't try to use a second method within my ExampleSpider I get back a list of 3000+ results, which is exactly what I want...minus that essential second piece of data. When I try to include this method all I get back are the starting URL's (ie. http://www.example.com, http://www.example1.com, etc.).

Any suggestions as to what I am doing wrong here?

import scrapy
from scrapy.contrib.loader import ItemLoader
from example.items import Item
from scrapy.http import Request
import re

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = ["list of about 15 different websites (ie. 'http://example.com', 'http://example1.com')"]

    # #Working method to scrape all of the data I need (except DOCTYPE from second page)
    # def parse(self, response):
    #   for sel in response.xpath('//table[@class="rightLinks"]/tr'):
    #     item = Item()
    #     item['company_name'] = sel.xpath('td[1]/a/text()').extract()
    #     item['website'] = sel.xpath('td[1]/a/@href').extract()
    #     item['location'] = sel.xpath('td[2]/text()').extract()
    #     yield item


    def parse(self, response):
      for sel in response.xpath('//table[@class="rightLinks"]/tr'):
        item = PharmaItem()
        item['company_name'] = sel.xpath('td[1]/a/text()').extract()
        item['website'] = sel.xpath('td[1]/a/@href').extract()
        item['location'] = sel.xpath('td[2]/text()').extract()
        #Setting up a new request to pass to the get_DT method, also passing along the 'item' class meta data

        #converting website from list item to string
        website = ''.join(item['website'])
        request = scrapy.Request(website, callback=self.get_DT)
        request.meta['item'] = item

        return request

    #Get DOCTYPE from each page
    def get_DT(self, response):
        item = response.meta['item']
        item['website'] = response.url
        dtype  = re.search("<!\s*doctype\s*(.*?)>", response.body, re.IGNORECASE)
        item['DOCTYPE'] = dtype

        yield item

UPDATE These are the two final functions that I used that worked. I took offwhitelotus's suggestion and tried it but this didn't work as it kept returning the parent DOCTYPE and not the traversed page DOCTYPE.

  def parse(self, response):
    for sel in response.xpath('//table[@class="rightLinks"]/tr'):
      item = PharmaItem()
      item['company_name'] = sel.xpath('td[1]/a/text()').extract()
      website = sel.xpath('td[1]/a/@href').extract()[0]
      item['location'] = sel.xpath('td[2]/text()').extract()

      # Setting up a new request to pass to the get_DT method, also passing along the 'item' class meta data
      request = scrapy.Request(website, callback=self.get_DT)
      request.meta['item'] = item
      yield request

  #Takes in the websites that were crawled from previous method and finds DOCTYPES
  def get_DT(self, response):
    item = response.meta['item']
    item['DOCTYPE'] = response.selector._root.getroottree().docinfo.doctype
    item['website'] = response.url

    yield item

Upvotes: 3

Views: 4453

Answers (2)

offwhitelotus
offwhitelotus

Reputation: 1079

I agree with Lawrence about fixing that line.

Also, I'm not sure why you have that callback function. You can get the doctype easily with:

import requests
html = requests.get('http://something.com').content
dtype = re.search("<!\s*doctype\s*(.*?)>", html, re.IGNORECASE)

I've never used scrapy.Requests, so I just used good old requests here.

Upvotes: 1

bosnjak
bosnjak

Reputation: 8614

You are running a loop, but calling return in it. It will prevent the loop from going through all the links. Use yield instead, in the parse() function.

Other than that, I don't get this part:

#converting website from list item to string
website = ''.join(item['website'])

That just seems wrong. If there are multiple web URL's there, this will result in a very bad and invalid URL. And if there is just one of them, then you should have collect it by getting the first and only list element (note the [0] at the end):

item['website'] = sel.xpath('td[1]/a/@href').extract()[0]

Also, I'm not sure why you are setting the item['website'] in the parse() function, since you are going to override it in the get_DT function anyway. You should just use a temporary variable, like so:

for sel in response.xpath('//table[@class="rightLinks"]/tr'):
    item = PharmaItem()
    item['company_name'] = sel.xpath('td[1]/a/text()').extract()
    item['location'] = sel.xpath('td[2]/text()').extract()
    website = sel.xpath('td[1]/a/@href').extract()
    request = scrapy.Request(website, callback=self.get_DT)
    request.meta['item'] = item
    yield request

Upvotes: 1

Related Questions