userHG
userHG

Reputation: 589

Scraping Android Store

I'm trying to scrape Android Store pages with Beautiful Soup in order to have a file that contains a list of packages. Here is my code :

from requests import get
from bs4 import BeautifulSoup
import json
import time

url = 'https://play.google.com/store/apps/collection/topselling_free'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

app_container = html_soup.find_all('div', class_="card no-rationale 
square-cover apps small")
file = open("applications.txt","w+")
for i in range(0,60):
#if range > 60 ; "IndexError: list index out of range"
    print(app_container[i].div['data-docid'])
    file.write(app_container[i].div['data-docid'] + "\n")

file.close()

The problem is that I can only collect 60 packages names because the javascript isn't loaded and if I have to load more apps I have to scrolldown. How can I reproduce this behaviour in Python to have more than 60 results ?

Upvotes: 1

Views: 263

Answers (2)

backtrack
backtrack

Reputation: 8154

My suggestion is to use Scrapy with Splash

http://splash.readthedocs.io/en/stable/scripting-tutorial.html.

Splash is a headless browser and you can render JS and execute Scripts. Some code sample

function main(splash)
    local num_scrolls = 10
    local scroll_delay = 1.0

    local scroll_to = splash:jsfunc("window.scrollTo")
    local get_body_height = splash:jsfunc(
        "function() {return document.body.scrollHeight;}"
    )
    assert(splash:go(splash.args.url))
    splash:wait(splash.args.wait)

    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(scroll_delay)
    end        
    return splash:html()
end

To render this script use 'execute' endpoint instead of render.html endpoint:

script = """<Lua script> """
scrapy_splash.SplashRequest(url, self.parse,
                            endpoint='execute', 
                            args={'wait':2, 'lua_source': script}, ...)

I am using Scrapy for Crawling and i believe you need to run the Crawling periodically. you can use Scrapyd for running the Scrapy spider.

I got this code from here

Upvotes: 1

Charles Landau
Charles Landau

Reputation: 4275

Would you consider a more fully featured scraper? Scrapy is purpose-built for the job: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016

Selenium is like driving a browser with code - if you can do it in person you can probably do it in selenium: scrape websites with infinite scrolling

Others have concluded that bs4 and requests is not enough for the task: How to load all entries in an infinite scroll at once to parse the HTML in python

Also note that scraping can be a bit of a grey area and that you should always try to be aware and respectful of site policies. Their TOS and robots.txt are always good places to peruse.

Upvotes: 1

Related Questions