DV Hughes
DV Hughes

Reputation: 305

Why is Scrapy not crawling or parsing?

I am attempting to scrape the Library of Congress/Thomas website. This Python script is intended to access a sample of 40 bills from their site (# 1-40 identifiers in the URLs). I want to parse the body of each piece of legislation, search in the body/content, extract links to potential multiple versions & follow.

Once on the version page(s) I want to parse the body of each piece of legislation, search the body/content & extract links to potential sections & follow.

Once on the section page(s) I want to parse the body of each section of a bill.

I believe there is some issue with the Rules/LinkExtractor segment of my code. The python code is executing, crawling the start urls, but not parsing or any of the subsequent tasks.

Three issues:

  1. Some bills do not have multiple versions (and ergo no links in the body portion of the URL
  2. Some bills do not have linked sections because they are so short, while some are nothing but links to sections.
  3. Some section links do not contain just section-specific content, and most of the content is just redundant inclusion of prior or subsequent section content.

My question is again, why is Scrapy not crawling or parsing?

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

class BillItem(Item):
    title = Field()
    body = Field()

class VersionItem(Item):
    title = Field()
    body = Field()

class SectionItem(Item):
    body = Field()

class Lrn2CrawlSpider(CrawlSpider):
    name = "lrn2crawl"
    allowed_domains = ["thomas.loc.gov"]
    start_urls = ["http://thomas.loc.gov/cgi-bin/query/z?c107:H.R.%s:" % bill for bill in xrange(000001,00040,00001) ### Sample of 40 bills; Total range of bills is 1-5767

    ]

rules = (
        # Extract links matching /query/ fragment (restricting tho those inside the content body of the url)
        # and follow links from them (since no callback means follow=True by default).
        # Desired result: scrape all bill text & in the event that there are multiple versions, follow them & parse.
        Rule(SgmlLinkExtractor(allow=(r'/query/'), restrict_xpaths=('//div[@id="content"]')), callback='parse_bills', follow=True),

        # Extract links in the body of a bill-version & follow them.
       #Desired result: scrape all version text & in the event that there are multiple sections, follow them & parse.
        Rule(SgmlLinkExtractor(restrict_xpaths=('//div/a[2]')), callback='parse_versions', follow=True)
    )

def parse_bills(self, response):
    hxs = HtmlXPathSelector(response)
    bills = hxs.select('//div[@id="content"]')
    scraped_bills = []
    for bill in bills:
        scraped_bill = BillItem() ### Bill object defined previously
        scraped_bill['title'] = bill.select('p/text()').extract()
        scraped_bill['body'] = response.body
        scraped_bills.append(scraped_bill)
    return scraped_bills

def parse_versions(self, response):
    hxs = HtmlXPathSelector(response)
    versions = hxs.select('//div[@id="content"]')
    scraped_versions = []
    for version in versions:
        scraped_version = VersionItem() ### Version object defined previously
        scraped_version['title'] = version.select('center/b/text()').extract()
        scraped_version['body'] = response.body
        scraped_versions.append(scraped_version)
    return scraped_versions

def parse_sections(self, response):
    hxs = HtmlXPathSelector(response)
    sections = hxs.select('//div[@id="content"]')
    scraped_sections = []
    for section in sections:
        scraped_section = SectionItem() ## Segment object defined previously
        scraped_section['body'] = response.body
        scraped_sections.append(scraped_section)
    return scraped_sections

spider = Lrn2CrawlSpider()

Upvotes: 1

Views: 1884

Answers (2)

Damián Castro
Damián Castro

Reputation: 337

Just for the record, the problem with your script is that the variable rules is not inside the scope of Lrn2CrawlSpider because it doesn't share the same indentation, so when alecxe fixed the indentation the variable rules became now an attribute of the class. Later the inherited method __init__() reads the attribute and compiles the rules and enforces them.

def __init__(self, *a, **kw):
    super(CrawlSpider, self).__init__(*a, **kw)
    self._compile_rules()

Erasing the last line had nothing to do with that.

Upvotes: 1

alecxe
alecxe

Reputation: 473863

I've just fixed the indentation, removed spider = Lrn2CrawlSpider() line at the end of the script, ran the spider via scrapy runspider lrn2crawl.py and it scrapes, follows links, returns items - your rules work.

Here's what I'm running:

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

class BillItem(Item):
    title = Field()
    body = Field()

class VersionItem(Item):
    title = Field()
    body = Field()

class SectionItem(Item):
    body = Field()

class Lrn2CrawlSpider(CrawlSpider):
    name = "lrn2crawl"
    allowed_domains = ["thomas.loc.gov"]
    start_urls = ["http://thomas.loc.gov/cgi-bin/query/z?c107:H.R.%s:" % bill for bill in xrange(000001,00040,00001) ### Sample of 40 bills; Total range of bills is 1-5767

    ]

    rules = (
            # Extract links matching /query/ fragment (restricting tho those inside the content body of the url)
            # and follow links from them (since no callback means follow=True by default).
            # Desired result: scrape all bill text & in the event that there are multiple versions, follow them & parse.
            Rule(SgmlLinkExtractor(allow=(r'/query/'), restrict_xpaths=('//div[@id="content"]')), callback='parse_bills', follow=True),

            # Extract links in the body of a bill-version & follow them.
           #Desired result: scrape all version text & in the event that there are multiple sections, follow them & parse.
            Rule(SgmlLinkExtractor(restrict_xpaths=('//div/a[2]')), callback='parse_versions', follow=True)
        )

    def parse_bills(self, response):
        hxs = HtmlXPathSelector(response)
        bills = hxs.select('//div[@id="content"]')
        scraped_bills = []
        for bill in bills:
            scraped_bill = BillItem() ### Bill object defined previously
            scraped_bill['title'] = bill.select('p/text()').extract()
            scraped_bill['body'] = response.body
            scraped_bills.append(scraped_bill)
        return scraped_bills

    def parse_versions(self, response):
        hxs = HtmlXPathSelector(response)
        versions = hxs.select('//div[@id="content"]')
        scraped_versions = []
        for version in versions:
            scraped_version = VersionItem() ### Version object defined previously
            scraped_version['title'] = version.select('center/b/text()').extract()
            scraped_version['body'] = response.body
            scraped_versions.append(scraped_version)
        return scraped_versions

    def parse_sections(self, response):
        hxs = HtmlXPathSelector(response)
        sections = hxs.select('//div[@id="content"]')
        scraped_sections = []
        for section in sections:
            scraped_section = SectionItem() ## Segment object defined previously
            scraped_section['body'] = response.body
            scraped_sections.append(scraped_section)
        return scraped_sections

Hope that helps.

Upvotes: 0

Related Questions