Reputation: 1
What I am trying to do is to scrape company information (thisisavailable.eu.pn/company.html) and add to the board dict all the board members with their respective data from separate pages.
So ideally the data that I get back from sample pages would be:
{
"company": "Mycompany Ltd",
"code": "3241234",
"phone": "2323232",
"email": "[email protected]",
"board": {
"1": {
"name": "Margaret Sawfish",
"code": "9999999999"
},
"2": {
"name": "Ralph Pike",
"code": "222222222"
}
}
}
I have searched Google and SO (like here and here and Scrapy docs etc) but have not been able to find a solution for problem exactly like this.
What I have been able to cobble together:
items.py:
import scrapy
class company_item(scrapy.Item):
name = scrapy.Field()
code = scrapy.Field()
board = scrapy.Field()
phone = scrapy.Field()
email = scrapy.Field()
pass
class person_item(scrapy.Item):
name = scrapy.Field()
code = scrapy.Field()
pass
spiders/example.py:
import scrapy
from try.items import company_item,person_item
class ExampleSpider(scrapy.Spider):
name = "example"
#allowed_domains = ["http://thisisavailable.eu.pn"]
start_urls = ['http://thisisavailable.eu.pn/company.html']
def parse(self, response):
if response.xpath("//table[@id='company']"):
yield self.parse_company(response)
pass
elif response.xpath("//table[@id='person']"):
yield self.parse_person(response)
pass
pass
def parse_company(self, response):
Company = company_item()
Company['name'] = response.xpath("//table[@id='company']/tbody/tr[1]/td[2]/text()").extract_first()
Company['code'] = response.xpath("//table[@id='company']/tbody/tr[2]/td[2]/text()").extract_first()
board = []
for person_row in response.xpath("//table[@id='board']/tbody/tr/td[1]"):
Person = person_item()
Person['name'] = person_row.xpath("a/text()").extract()
print (person_row.xpath("a/@href").extract_first())
request = scrapy.Request('http://thisisavailable.eu.pn/'+person_row.xpath("a/@href").extract_first(), callback=self.parse_person)
request.meta['Person'] = Person
return request
board.append(Person)
Company['board'] = board
return Company
def parse_person(self, response):
print('PERSON!!!!!!!!!!!')
print (response.meta)
Person = response.meta['Person']
Person['name'] = response.xpath("//table[@id='person']/tbody/tr[1]/td[2]/text()").extract_first()
Person['code'] = response.xpath("//table[@id='person']/tbody/tr[2]/td[2]/text()").extract_first()
yield Person
UPDATE: As Rafael noticed, initial problem was with allowed_domains being wrong - I commented it out for the time being and now when I run it, I get (added *'s to URLs due to low rep):
scrapy crawl example 2017-03-07 09:41:12 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: proov) 2017-03-07 09:41:12 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'proov.spiders', 'SPIDER_MODULES': ['proov.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'proov'} 2017-03-07 09:41:12 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2017-03-07 09:41:13 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-03-07 09:41:13 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-03-07 09:41:13 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-03-07 09:41:13 [scrapy.core.engine] INFO: Spider opened 2017-03-07 09:41:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-03-07 09:41:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-03-07 09:41:14 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://*thisisavailable.eu.pn/robots.txt> (referer: None) 2017-03-07 09:41:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://*thisisavailable.eu.pn/scrapy/company.html> (referer: None) person.html person2.html 2017-03-07 09:41:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://thisisavailable.eu.pn/person2.html> (referer: http://*thisisavailable.eu.pn/company.html) PERSON!!!!!!!!!!! 2017-03-07 09:41:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://*thisisavailable.eu.pn/person2.html> {'code': u'222222222', 'name': u'Kaspar K\xe4nnuotsa'} 2017-03-07 09:41:15 [scrapy.core.engine] INFO: Closing spider (finished) 2017-03-07 09:41:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 936, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'downloader/response_bytes': 1476, 'downloader/response_count': 3, 'downloader/response_status_count/200': 2, 'downloader/response_status_count/404': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 3, 7, 7, 41, 15, 571000), 'item_scraped_count': 1, 'log_count/DEBUG': 5, 'log_count/INFO': 7, 'request_depth_max': 1, 'response_received_count': 3, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2017, 3, 7, 7, 41, 13, 404000)} 2017-03-07 09:41:15 [scrapy.core.engine] INFO: Spider closed (finished)
and if run with "-o file.json", file content is:
[ {"code": "222222222", "name": "Ralph Pike"} ]
So a bit further, but I am still at loss how to make it work.
Can somebody help me make this work?
Upvotes: 0
Views: 842
Reputation: 5240
Your problem isn't related to having multiple items, even though it will be in the future.
You problem is explained in the output
[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'kidplay-wingsuit.c9users.io': http://thisisavailable.eu.pn/scrapy/person2.html> 2017-03-06 10:44:33
It means that is going to a domain outside of your allowed_domains list.
Your allowed domains is wrong. It should be
allowed_domains = ["thisisavailable.eu.pn"]
Note:
Instead of using a different item for Person
just use it as a field in Company
and assign a dict
or list
to it while scraping
Upvotes: 2