Reputation: 129
I'm trying to scrape google play store using Scrapy and by default I can get only 50 links while I can see 257 links in total. So I applied request headers and form request as well but both of method are failed. Here is the error I'm receiving. Please have a look
2020-10-30 18:16:54 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://play.google.com/_/PlayStoreUi/browserinfo?f.sid =-3103376089553482051&bl=boq_playuiserver_20201027.06_p0&hl=en&authuser=0&soc-app=121&soc-platform=1&soc-device=1&_reqid=4962278&rt=j>: HTTP status code is not handled or not allowed
Here is the targeted URL where 257 listed held https://play.google.com/store/search?q=quotes&c=apps but getting only 50 by default. Code is given below I tried. Please help me
from scrapy import Spider
from scrapy.http import Request, FormRequest
from scrapy.utils.response import open_in_browser
class PlaySpider(Spider):
name = 'play'
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/search?q=quotes&c=apps']
# def parse(self, response):
# data = {
# 'f.req': '%5B9%2C1%2C1.25%2C%5Bnull%2C1350%2C2400%5D%2C%5Bnull%2C327%2C1344%5D%2C%5Btrue%2Ctrue%2Ctrue%2Ctrue%5D%2C%5Bfalse%2C2%2C2%5D%5D&',
# 'at': 'AE2DSODV9YrtVLLv1YugtW097VJD%3A1604056672307&'
# }
# yield FormRequest(
# url='https://play.google.com/_/PlayStoreUi/browserinfo?f.sid=-3103376089553482051&bl=boq_playuiserver_20201027.06_p0&hl=en&authuser=0&soc-app=121&soc-platform=1&soc-device=1&_reqid=4962278&rt=j',
# formdata=data,
# callback=self.parse_play
# )
#
# def parse_play(self, response):
# open_in_browser(response)
def parse(self, response):
url = 'https://play.google.com/_/PlayStoreUi/browserinfo?f.sid=-3103376089553482051&bl=boq_playuiserver_20201027.06_p0&hl=en&authuser=0&soc-app=121&soc-platform=1&soc-device=1&_reqid=4962278&rt=j'
headers = {
'authority': 'play.google.com',
'method': 'POST',
'path': '/_/PlayStoreUi/browserinfo?f.sid=-3103376089553482051&bl=boq_playuiserver_20201027.06_p0&hl=en&authuser=0&soc-app=121&soc-platform=1&soc-device=1&_reqid=4962278&rt=j',
'scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,mt;q=0.8,fr;q=0.7,ru;q=0.6,bn;q=0.5,de;q=0.4',
'content-length': '182',
'content-type': 'application/x-www-form-urlencoded;charset=UTF-8',
'cookie': 'SID=2we9KP-jDu8bZ3iag5AcctRssfi1KPyfUWFYpxI2W0TxFwyqOaoCBO3CvfCHuoK60oQS7w.; __Secure-3PSID=2we9KP-jDu8bZ3iag5AcctRssfi1KPyfUWFYpxI2W0TxFwyqO8JF9jd1Qit9bAylaNfesQ.; HSID=AL01amZ-pbltbWMV7; SSID=AfC24bLLuvHWYWazZ; APISID=UWuhW7qZn0Yg6zUk/A21wIyYSi2J4KvZtL; SAPISID=IAzttnMi1S3MdJDv/A4ybYudwcofhwA8gA; __Secure-3PAPISID=IAzttnMi1S3MdJDv/A4ybYudwcofhwA8gA; OTZ=5687668_32_32__32_; NID=204=fblQ_6pXpYwCNy6yN1zQ2EoRT9VaU0_WOpdIxFmAzAKtr0QKP4hwzIj8yU0s2AyeWTWCc9m7tWkeVjTwKXgp4e4cLKB7UGNyuUIJAbmirj9hT3hXFQ4wUvXa-NCgJIJ-38ZiAyfOJSZsVJEVcWodA1nUQzPfaH06WU2SIlwd1M8qK-GEp1MD569Xth3e3BeB8qt9-vIVSibpZc_aVbOKp38p4yshqvBv5LbPajmcuKkP-1QsY3Uwe_b546Ei60KN8eJ44guVRZ6dBZI; 1P_JAR=2020-10-30-11; PLAY_ACTIVE_ACCOUNT=ICrt_XL61NBE_S0rhk8RpG0k65e0XwQVdDlvB6kxiQ8=suvashish.halder@gmail.com; OGPC=19009731-1:19008539-5:19010599-2:19015969-1:19011552-1:; OGP=-19009731:-19015969:-19011552:-19010599:-19008539:; SIDCC=AJi4QfE5wzww8DWa6SMq2omQvpVRaI_7hUuhZGfaOHbga3NwN7OcIiMv9ILYSMgKrY1i4pNwKA; __Secure-3PSIDCC=AJi4QfFU93FMessFFLviRRPm3buykQeAylLNYGhgFVIrdIde1InWntlWllI0sA3h6dr6EDMgmGQ',
'origin': 'https://play.google.com',
'referer': 'https://play.google.com',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4287.0 Safari/537.36 Edg/88.0.673.0',
'x-same-domain': '1',
}
yield Request(url=url,
method='POST',
dont_filter=True,
headers=headers,
callback=self.parse_play)
def parse_play(self, response):
open_in_browser(response)
Upvotes: 2
Views: 734
Reputation: 99
you get 50 because the content is loaded dynamically via JavaScript. To see it disable JavaScript on your browser.
Upvotes: 1