Tony
Tony

Reputation: 394

Scrapy with deeper level using Xpath

I am trying to scrap information, the information I am looking for is accessible from a search adress: [http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd+and+state%3DMN&first=2008&last=2011][1]

I want scrapy to follow the written link (let's try for one then I will look for generating the number associated with) the Xpath for one link is:

/html/body/div/table/tbody/tr[29]/td[3]/a[2]

after crawling to this link I want scrappy to crawl to the xml files available on the next page. The Xpath for the link is general:

//*[@id="formDiv"]/div/table/tbody/tr[3]/td[3]/a

and finally I want scrapy to scrap some data from the xml page.

Launching Scrapy with: scrapy crawl DFORM -o items.json -t json All I get on my json file is: "[".

items.py

from scrapy.item import Item, Field

class SecformD(Item):
    company = Field()
    filling_date = Field()
    types_of_securities = Field()
    offering_amount = Field()
    sold_amount = Field()
    remaining = Field()
    investors_accredited = Field()
    investors_non_accredited = Field()

*Formds_Crawler.py*

from scrapy.contrib.spiders import BaseSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import XmlXPathSelector
from formds.items import SecformD


class SecDform(CrawlSpider):
    name = "DFORM"
    allowed_domain = ["http://www.sec.gov"]
    start_urls = [
        "http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd+and+state%3DMN&first=2008&last=2011"
        ]
    rules = (
        Rule(SgmlLinkExtractor(restrict_xpaths=('/html/body/div/table/tbody/tr[27]/td[3]/a[2]')), callback='parse_formd', follow= True),)


def parse_formd(self, response):
    xxs = XMLPathSelector(response)
    hsx = HtmlXPathSelector(response)

    sites = xxs.select('//*[@id="formDiv"]/div/table/tbody/tr[3]/td[3]/a')
    items = []
    for site in sites:
        item = SecFormD()
        item['company'] = site.select('//*[@id="collapsible1"]/div[1]/div[2]/div[2]/span[2]/text()').extract()
        item['filling_date'] = site.select('//*[@id="collapsible40"]/div[1]/div[2]/div[5]/span[2]/text()').extract()
        item['types_of_securities'] = site.select('//*[@id="collapsible37"]/div[1]/div[2]/div[1]/span[2]/text()').extract()
        item['offering_amount'] = site.select('//*[@id="collapsible39"]/div[1]/div[2]/div[1]/span[2]/text()').extract()
        item['sold_amount'] = site.select('//*[@id="collapsible39"]/div[1]/div[2]/div[2]/span[2]/text()').extract()
        item['remaining'] = site.select('//*[@id="collapsible39"]/div[1]/div[2]/div[3]/span[2]/text()').extract()
        item['investors_accredited'] = site.select('//*[@id="collapsible40"]/div[1]/div[2]/div[2]/span[2]/text()').extract()
        item['investors_non_accredited'] = site.select('//*[@id="collapsible40"]/div[1]/div[2]/div[1]/span[2]/text()').extract()

        items.append(item)
    return items


***Here is the log:***
USComputer:formds psykoboy$ scrapy crawl DFORM -o items.json -t json
2013-07-18 21:18:37-0500 [scrapy] INFO: Scrapy 0.16.4 started (bot: formds)
2013-07-18 21:18:38-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-18 21:18:38-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-18 21:18:38-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-18 21:18:38-0500 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-18 21:18:38-0500 [DFORM] INFO: Spider opened
2013-07-18 21:18:38-0500 [DFORM] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-18 21:18:38-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-18 21:18:38-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-18 21:18:42-0500 [DFORM] DEBUG: Crawled (200) <GET http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd%20and%20state%3DMN&start=81&count=80&first=2008&last=2011> (referer: None)
2013-07-18 21:18:43-0500 [DFORM] DEBUG: Crawled (200) <GET http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd+and+state%3DMN&first=2008&last=2011> (referer: None)
2013-07-18 21:18:44-0500 [DFORM] DEBUG: Crawled (200) <GET http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd%20and%20state%3DMN&start=161&count=80&first=2008&last=2011> (referer: None)
2013-07-18 21:18:45-0500 [DFORM] DEBUG: Crawled (200) <GET http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd%20and%20state%3DMN&start=241&count=80&first=2008&last=2011> (referer: None)
2013-07-18 21:18:45-0500 [DFORM] DEBUG: Crawled (200) <GET http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd%20and%20state%3DMN&start=321&count=80&first=2008&last=2011> (referer: None)
2013-07-18 21:18:46-0500 [DFORM] DEBUG: Crawled (200) <GET http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd%20and%20state%3DMN&start=401&count=80&first=2008&last=2011> (referer: None)
2013-07-18 21:18:46-0500 [DFORM] DEBUG: Crawled (200) <GET http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd%20and%20state%3DMN&start=481&count=80&first=2008&last=2011> (referer: None)
2013-07-18 21:18:47-0500 [DFORM] DEBUG: Crawled (200) <GET http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd%20and%20state%3DMN&start=561&count=80&first=2008&last=2011> (referer: None)
2013-07-18 21:18:47-0500 [DFORM] DEBUG: Crawled (200) <GET http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd%20and%20state%3DMN&start=641&count=80&first=2008&last=2011> (referer: None)
2013-07-18 21:18:48-0500 [DFORM] DEBUG: Crawled (200) <GET http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd%20and%20state%3DMN&start=721&count=80&first=2008&last=2011> (referer: None)
2013-07-18 21:18:48-0500 [DFORM] DEBUG: Crawled (200) <GET http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd%20and%20state%3DMN&start=721&count=80&first=2008&last=2011> (referer: None)
2013-07-18 21:18:48-0500 [DFORM] INFO: Closing spider (finished)
2013-07-18 21:18:48-0500 [DFORM] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 3419,
     'downloader/request_count': 11,
     'downloader/request_method_count/GET': 11,
     'downloader/response_bytes': 68182,
     'downloader/response_count': 11,
     'downloader/response_status_count/200': 11,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 7, 19, 2, 18, 48, 189346),
     'log_count/DEBUG': 17,
     'log_count/INFO': 4,
     'response_received_count': 11,
     'scheduler/dequeued': 11,
     'scheduler/dequeued/memory': 11,
     'scheduler/enqueued': 11,
     'scheduler/enqueued/memory': 11,
     'start_time': datetime.datetime(2013, 7, 19, 2, 18, 38, 701571)}
2013-07-18 21:18:48-0500 [DFORM] INFO: Spider closed (finished)

Upvotes: 0

Views: 1840

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

If you remove the tbody/ your first 2 XPath expressions work in scrapy shell

paul@wheezy:~$ scrapy shell 'http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3Dd+and+state%3DMN&first=2008&last=2011'
...
In [1]: hxs.select('/html/body/div/table/tr[27]/td[3]/a[2]/@href').extract()
Out[1]: [u'/Archives/edgar/data/1490747/000149074710000001/0001490747-10-000001-index.htm']
In [2]: next = hxs.select('/html/body/div/table/tr[27]/td[3]/a[2]/@href').extract()[0]
In [3]: import urlparse
In [4]: next_url = urlparse.urljoin(response.url, next)
In [5]: next_url
Out[5]: u'http://www.sec.gov/Archives/edgar/data/1490747/000149074710000001/0001490747-10-000001-index.htm'
In [6]: fetch(next_url)
2013-07-19 09:42:58+0200 [default] DEBUG: Crawled (200) <GET http://www.sec.gov/Archives/edgar/data/1490747/000149074710000001/0001490747-10-000001-index.htm> (referer: None)
...
In [8]: hxs.select('//*[@id="formDiv"]/div/table/tr[3]/td[3]/a')
Out[8]: [<HtmlXPathSelector xpath='//*[@id="formDiv"]/div/table/tr[3]/td[3]/a' data=u'<a href="/Archives/edgar/data/1490747/00'>]

But the

sites = xxs.select('//*[@id="formDiv"]/div/table/tbody/tr[3]/td[3]/a')
items = []
for site in sites:
    ... extract item values

part is not what you meant.

You want to follow the links to the XML documents, and parse them, so you need to tell Scrapy to fetch those pages, sites = xxs.select('//*[@id="formDiv"]/div/table/tbody/tr[3]/td[3]/a') does not do that, it returns the a tags, it does not issue a request to get the document

You would need something like:

import urlparse
from scrapy.http import Request
...
class SecDform(CrawlSpider):
    ...
    def parse_formd(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//*[@id="formDiv"]/div/table/tr[3]/td[3]/a/@href').extract()
        for site in sites:
            yield Request(url=urlparse.urljoin(response.url, site), callback=self.parse_xml_document)

And define a new parse_xml_document() callback method that contains your Item extraction logic for these XML documents.

Your XPath expressions for your item fields come from Chrome or Firebug explorer, right? ("collapsibla0" etc.). What you need is work on the XML structure directly, not what the browser converts to HTML to display. I only did the "company" part to illustrate.

    def parse_xml_document(self, response):
        xxs = XmlXPathSelector(response)
        item = SecFormD()
        item["company"] = xxs.select('./primaryIssuer/entityName/text()').extract()[0]
        ...
        return item

A good way to work on you XPath expression for your items it to use scrapy shell <url_of_xml_document> as I do below for "company" (see also http://doc.scrapy.org/en/latest/intro/tutorial.html#trying-selectors-in-the-shell)

paul@wheezy:~$ scrapy shell http://www.sec.gov/Archives/edgar/data/1490747/000149074710000001/primary_doc.xml
In [6]: xxs.select('./primaryIssuer')
Out[6]: [<XmlXPathSelector xpath='./primaryIssuer' data=u'<primaryIssuer>\n        <cik>0001490747<'>]

In [7]: xxs.select('./primaryIssuer/entityName')
Out[7]: [<XmlXPathSelector xpath='./primaryIssuer/entityName' data=u'<entityName>AEI CREDIT TENANT FUND 35 LP'>]

In [8]: xxs.select('./primaryIssuer/entityName/text()')
Out[8]: [<XmlXPathSelector xpath='./primaryIssuer/entityName/text()' data=u'AEI CREDIT TENANT FUND 35 LP'>]

In [9]: xxs.select('./primaryIssuer/entityName/text()').extract()
Out[9]: [u'AEI CREDIT TENANT FUND 35 LP']

In [10]: 

Edit: updated gist with Rules() to follow [NEXT] pages and links to docs in all rows https://gist.github.com/redapple/02a55aa6aaac0df2fb75

Upvotes: 2

Related Questions