Reputation: 93
I'd like to scrape a movie forum, which has a structure like
Page 1
Thread 1 in Page 1
Thread 2 in Page 1
...
Page 2
Thread 1 in Page 2
Thread 2 in Page 2
...
The pages and threads have very different htmls, so I have written xpath expressions to extract the information I need for pages and threads.
In the parese()
method of my spider, I used an example from the documentation to go through each page:
page_links = ['page_1', 'page_2', ...]
for page_link in page_links:
if page_link is not None:
page_link = response.urljoin(page_link)
yield scrapy.Request(page_link, callback=self.parse)
So I can get the URL of every thread in every page.
I suppose the next thing I should do is to get the response
of each thead, and run a function to parse these responses. But since I'm new to OOP, I'm quite confused with what I should do.
I have a list thread_links
that stores the URLs of threads, and I'm trying to do something like:
thread_links = ['thread_1', 'thread_2', ...]
for thread_link in thread_links:
yield scrapy.Request(thead_link)
but how can I pass these responses to a function like parse_thread(self, response)
?
Update: Here are my codes:
# -*- coding: utf-8 -*-
import scrapy
class ShtSpider(scrapy.Spider):
name = 'sht'
allowed_domains = ['AAABBBCCC.com']
start_urls = [
'https://AAABBBCCC/forum-103-1.html',
]
thread_links = []
def parse(self, response):
temp = response.selector.xpath("//div[@class='pg']/a[@class='last']/@href").get()
total_num_pages = int(temp.split('.')[0].split('-')[-1])
for page_i in range(total_num_pages):
page_link = temp.split('.')[0].rsplit('-', 1)[0] + '-' + str(page_i) + '.html'
if page_link is not None:
page_link = response.urljoin(page_link)
print(page_link)
yield scrapy.Request(page_link, callback=self.parse)
self.thread_links.extend(response.selector.
xpath("//tbody[contains(@id,'normalthread')]//td[@class='icn']//a/@href").getall())
for thread_link in self.thread_links:
thread_link = response.urljoin(thread_link)
print(thread_link)
yield scrapy.Request(url=thread_link, callback=self.parse_thread)
def parse_thread(self, response):
def extract_thread_data(xpath_expression):
return response.selector.xpath(xpath_expression).getall()
yield {
'movie_number_and_title': extract_thread_data("//span[@id='thread_subject']/text()"),
'movie_pics_links': extract_thread_data("//td[@class='t_f']//img/@file"),
'magnet_link': extract_thread_data("//div[@class='blockcode']/div//li/text()"),
'torrent_link': extract_thread_data("//p[@class='attnm']/a/@href"),
'torrent_name': extract_thread_data("//p[@class='attnm']/a/text()"),
}
I'm using print()
to check page_link
and thread_link
, they seems to be working well, URLs to all pages and threads shows up correctly, but the program stoped after only crawling one page. Here is the information from consolo:
2020-07-18 10:54:30 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2020-07-18 10:54:30 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-18 10:54:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 690,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 17304,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 18, 2, 54, 30, 777513),
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'memusage/max': 53985280,
'memusage/startup': 48422912,
'offsite/domains': 1,
'offsite/filtered': 919087,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 7, 18, 2, 52, 54, 509604)}
2020-07-18 10:54:30 [scrapy.core.engine] INFO: Spider closed (finished)
Update: Here is the example from documentation
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
and if I understand it correctly, it will will print that it has visited the url http://www.example.com/some_page.html
Here are my spider, which I have just created a project named SMZDM and created a spider using scrapy genspider smzdm https://www.smzdm.com
# -*- coding: utf-8 -*-
import scrapy
class SmzdmSpider(scrapy.Spider):
name = 'smzdm'
allowed_domains = ['https://www.smzdm.com']
start_urls = ['https://www.smzdm.com/fenlei/diannaozhengji/']
def parse(self, response):
return scrapy.Request("https://www.smzdm.com/fenlei/diannaozhengji/",
callback=self.parse_page)
def parse_page(self, response):
self.logger.info("Visited %s", response.url)
print(f'Crawled {response.url}')
I have hardcoded https://www.smzdm.com/fenlei/diannaozhengji/
in the parse method and just want to get it working.
But when run using scrapy crawl smzdm
, nothing shows up in the terminal. It seems the parse_page
method was never executed.
(Crawler) zheng@Macbook_Pro spiders % scrapy crawl smzdm
2020-07-29 17:10:03 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: SMZDM)
2020-07-29 17:10:03 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.6 (default, Jan 8 2020, 13:42:34) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Darwin-19.6.0-x86_64-i386-64bit
2020-07-29 17:10:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'SMZDM', 'NEWSPIDER_MODULE': 'SMZDM.spiders', 'SPIDER_MODULES': ['SMZDM.spiders']}
2020-07-29 17:10:03 [scrapy.extensions.telnet] INFO: Telnet Password: e3b9631aa810732d
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-29 17:10:03 [scrapy.core.engine] INFO: Spider opened
2020-07-29 17:10:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-29 17:10:03 [py.warnings] WARNING: /Applications/anaconda3/envs/Crawler/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.smzdm.com in allowed_domains.
warnings.warn(message, URLWarning)
2020-07-29 17:10:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-29 17:10:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.smzdm.com/fenlei/diannaozhengji/> (referer: None)
2020-07-29 17:10:03 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.smzdm.com': <GET https://www.smzdm.com/fenlei/diannaozhengji/>
2020-07-29 17:10:03 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-29 17:10:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 321,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 40270,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 29, 9, 10, 3, 836061),
'log_count/DEBUG': 2,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'memusage/max': 48422912,
'memusage/startup': 48422912,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 7, 29, 9, 10, 3, 381441)}
2020-07-29 17:10:03 [scrapy.core.engine] INFO: Spider closed (finished)
Upvotes: 1
Views: 544
Reputation: 2335
A full code example would help me direct you towards the best way to achieve what you want. But sounds like you're on the right way.
Think you just need to do a callback on a parse_thread
function
thread_links = ['thread1','thread2']
def thread_links(self,response)
for thread_link in self.thread_links:
yield scrapy.Request(url=thread_link,callback=self.parse_thread)
def parse_thread(self,response):
print(response.text)
Here we're taking the links from thread_links list, NOTE you have to do self.thread_links, thats because you're defining the thread_links list OUTSIDE the function. It's whats called a class variable and needs to accessed inside the function by self.VARIABLE.
We then add a callback to parse_thread, again note how we're using self.parse_thread here. Scrapy makes a requests and delivers the response to the parse_thread function. Here I've just printed that response out.
Since you've provided some code here's where I think you may be going wrong, if you've checked the pages and threads links are outputting fine.
def parse_thread(self, response):
def extract_thread_data(xpath_expression):
return response.selector.xpath(xpath_expression).getall()
Change this to
def parse_thread(self,response):
yield response.xpath(xpath_expression).getall()
I'm not sure because I can't test the code, but a nested function is probably going to cause scrapy abit of problems.
You need to include these in your settings.py
USER_AGENT = 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36'
ROBOTSTXT_OBEY = False
The site is quite anti-scraping so within the robots.txt on the website it doesn't want you to scrape this site. To work around this we set ROBOTSTXT_OBEY = False.
In addition to this, when scrapy sends the HTTP request, you haven't defined a user-agent, this could be any user-agent. I've given an example of one that worked for me. Without the user-agent its detecting that it is not a browser making this type of request and scrapy is not scraping the url.
Upvotes: 2
Reputation: 93
I have finished my program now and I'd like to summarize two useful tips:
1.Try to comment out allowed_domains
when debuging;
2.I'm not sure why but using response.Request
has been problematic for me, when following links just use response.follow
Upvotes: 0