booleantrue
booleantrue

Reputation: 129

how to scrape google play store using scrapy

I'm trying to scrape google play store using Scrapy and by default I can get only 50 links while I can see 257 links in total. So I applied request headers and form request as well but both of method are failed. Here is the error I'm receiving. Please have a look

2020-10-30 18:16:54 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://play.google.com/_/PlayStoreUi/browserinfo?f.sid =-3103376089553482051&bl=boq_playuiserver_20201027.06_p0&hl=en&authuser=0&soc-app=121&soc-platform=1&soc-device=1&_reqid=4962278&rt=j>: HTTP status code is not handled or not allowed

Here is the targeted URL where 257 listed held https://play.google.com/store/search?q=quotes&c=apps but getting only 50 by default. Code is given below I tried. Please help me

from scrapy import Spider
from scrapy.http import Request, FormRequest
from scrapy.utils.response import open_in_browser


class PlaySpider(Spider):
    name = 'play'
    allowed_domains = ['play.google.com']
    start_urls = ['https://play.google.com/store/search?q=quotes&c=apps']

    # def parse(self, response):
    #     data = {
    #         'f.req': '%5B9%2C1%2C1.25%2C%5Bnull%2C1350%2C2400%5D%2C%5Bnull%2C327%2C1344%5D%2C%5Btrue%2Ctrue%2Ctrue%2Ctrue%5D%2C%5Bfalse%2C2%2C2%5D%5D&',
    #         'at': 'AE2DSODV9YrtVLLv1YugtW097VJD%3A1604056672307&'
    #     }
    #     yield FormRequest(
    #         url='https://play.google.com/_/PlayStoreUi/browserinfo?f.sid=-3103376089553482051&bl=boq_playuiserver_20201027.06_p0&hl=en&authuser=0&soc-app=121&soc-platform=1&soc-device=1&_reqid=4962278&rt=j',
    #         formdata=data,
    #         callback=self.parse_play
    #     )
    #
    # def parse_play(self, response):
    #     open_in_browser(response)


    def parse(self, response):
        url = 'https://play.google.com/_/PlayStoreUi/browserinfo?f.sid=-3103376089553482051&bl=boq_playuiserver_20201027.06_p0&hl=en&authuser=0&soc-app=121&soc-platform=1&soc-device=1&_reqid=4962278&rt=j'
        headers = {
            'authority': 'play.google.com',
            'method': 'POST',
            'path': '/_/PlayStoreUi/browserinfo?f.sid=-3103376089553482051&bl=boq_playuiserver_20201027.06_p0&hl=en&authuser=0&soc-app=121&soc-platform=1&soc-device=1&_reqid=4962278&rt=j',
            'scheme': 'https',
            'accept': '*/*',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'en-US,en;q=0.9,mt;q=0.8,fr;q=0.7,ru;q=0.6,bn;q=0.5,de;q=0.4',
            'content-length': '182',
            'content-type': 'application/x-www-form-urlencoded;charset=UTF-8',
            'cookie': 'SID=2we9KP-jDu8bZ3iag5AcctRssfi1KPyfUWFYpxI2W0TxFwyqOaoCBO3CvfCHuoK60oQS7w.; __Secure-3PSID=2we9KP-jDu8bZ3iag5AcctRssfi1KPyfUWFYpxI2W0TxFwyqO8JF9jd1Qit9bAylaNfesQ.; HSID=AL01amZ-pbltbWMV7; SSID=AfC24bLLuvHWYWazZ; APISID=UWuhW7qZn0Yg6zUk/A21wIyYSi2J4KvZtL; SAPISID=IAzttnMi1S3MdJDv/A4ybYudwcofhwA8gA; __Secure-3PAPISID=IAzttnMi1S3MdJDv/A4ybYudwcofhwA8gA; OTZ=5687668_32_32__32_; NID=204=fblQ_6pXpYwCNy6yN1zQ2EoRT9VaU0_WOpdIxFmAzAKtr0QKP4hwzIj8yU0s2AyeWTWCc9m7tWkeVjTwKXgp4e4cLKB7UGNyuUIJAbmirj9hT3hXFQ4wUvXa-NCgJIJ-38ZiAyfOJSZsVJEVcWodA1nUQzPfaH06WU2SIlwd1M8qK-GEp1MD569Xth3e3BeB8qt9-vIVSibpZc_aVbOKp38p4yshqvBv5LbPajmcuKkP-1QsY3Uwe_b546Ei60KN8eJ44guVRZ6dBZI; 1P_JAR=2020-10-30-11; PLAY_ACTIVE_ACCOUNT=ICrt_XL61NBE_S0rhk8RpG0k65e0XwQVdDlvB6kxiQ8=suvashish.halder@gmail.com; OGPC=19009731-1:19008539-5:19010599-2:19015969-1:19011552-1:; OGP=-19009731:-19015969:-19011552:-19010599:-19008539:; SIDCC=AJi4QfE5wzww8DWa6SMq2omQvpVRaI_7hUuhZGfaOHbga3NwN7OcIiMv9ILYSMgKrY1i4pNwKA; __Secure-3PSIDCC=AJi4QfFU93FMessFFLviRRPm3buykQeAylLNYGhgFVIrdIde1InWntlWllI0sA3h6dr6EDMgmGQ',
            'origin': 'https://play.google.com',
            'referer': 'https://play.google.com',
            'sec-fetch-dest': 'empty',
            'sec-fetch-mode': 'cors',
            'sec-fetch-site': 'same-origin',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4287.0 Safari/537.36 Edg/88.0.673.0',
            'x-same-domain': '1',
        }

        yield Request(url=url,
                      method='POST',
                      dont_filter=True,
                      headers=headers,
                      callback=self.parse_play)

    def parse_play(self, response):
        open_in_browser(response)

Upvotes: 2

Views: 734

Answers (1)

Jonas
Jonas

Reputation: 99

you get 50 because the content is loaded dynamically via JavaScript. To see it disable JavaScript on your browser.

Upvotes: 1

Related Questions