Reputation: 101
I want to Scrape Comments from https://m.youtube.com
When I tried to scrape https://m.youtube.com, first its Redirecting me to https://www.youtube.com. I've programmed my spider to not obey the robot.txt, disabled cookies, tried meta=dont_redirect. Now its not redirecting me to https://www.youtube.com but now i get response "Ignoring response <303 https://m.youtube.com/view_comment?v=xHkL9PU7o9k&gl=US&hl=en&client=mv-google>: HTTP status code is not handled or not allowed" How Can I solve this.
My Spider Code is below:
import scrapy
class CommentsSpider(scrapy.Spider):
name = 'comments'
allowed_domains = ['m.youtube.com']
start_urls = [
'https://m.youtube.com/view_comment?
v=xHkL9PU7o9k&gl=US&hl=en&client=mvgoogle'
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta = {'dont_redirect': True})
def parse(self, response):
x = response.xpath('/html/body/div[4]/div[2]/text()').extract()
y =
response.xpath('/html/body/div[4]/div[3]/div[2]/text()').extract()
yield{'Comments': (x, y)}
'''
Output:
2019-07-18 16:07:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-07-18 16:07:24 [scrapy.core.engine] DEBUG: Crawled (303) <GET https://m.youtube.com/view_comment?v=xHkL9PU7o9k&gl=US&hl=en&client=mv-google> (referer: None)
2019-07-18 16:07:24 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <303 https://m.youtube.com/view_comment?v=xHkL9PU7o9k&gl=US&hl=en&client=mv-google>: HTTP status code is not handled or not allowed
2019-07-18 16:07:24 [scrapy.core.engine] INFO: Closing spider (finished)
Upvotes: 0
Views: 1651
Reputation: 2116
I would try to use a user-agent string of a mobile browser to avoid getting redirected:
USER_AGENT='Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1'
headers = {'User-Agent': USER_AGENT}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, headers=self.headers)
Upvotes: 1
Reputation: 977
According to Scrapy documentation you can use the handle_httpstatus_list
spider attribute.
In your case:
class CommentsSpider(scrapy.Spider):
name = 'comments'
allowed_domains = ['m.youtube.com']
start_urls = [
'https://m.youtube.com/view_commentv=xHkL9PU7o9k&gl=US&hl=en&client=mvgoogle'
]
handle_httpstatus_list = [303]
Upvotes: 2