Reputation: 589
I'm trying to scrape Android Store pages with Beautiful Soup in order to have a file that contains a list of packages. Here is my code :
from requests import get
from bs4 import BeautifulSoup
import json
import time
url = 'https://play.google.com/store/apps/collection/topselling_free'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
app_container = html_soup.find_all('div', class_="card no-rationale
square-cover apps small")
file = open("applications.txt","w+")
for i in range(0,60):
#if range > 60 ; "IndexError: list index out of range"
print(app_container[i].div['data-docid'])
file.write(app_container[i].div['data-docid'] + "\n")
file.close()
The problem is that I can only collect 60 packages names because the javascript isn't loaded and if I have to load more apps I have to scrolldown. How can I reproduce this behaviour in Python to have more than 60 results ?
Upvotes: 1
Views: 263
Reputation: 8154
My suggestion is to use Scrapy with Splash
http://splash.readthedocs.io/en/stable/scripting-tutorial.html.
Splash is a headless browser and you can render JS and execute Scripts. Some code sample
function main(splash)
local num_scrolls = 10
local scroll_delay = 1.0
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
return splash:html()
end
To render this script use 'execute' endpoint instead of render.html endpoint:
script = """<Lua script> """
scrapy_splash.SplashRequest(url, self.parse,
endpoint='execute',
args={'wait':2, 'lua_source': script}, ...)
I am using Scrapy for Crawling and i believe you need to run the Crawling periodically. you can use Scrapyd for running the Scrapy spider.
I got this code from here
Upvotes: 1
Reputation: 4275
Would you consider a more fully featured scraper? Scrapy is purpose-built for the job: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
Selenium is like driving a browser with code - if you can do it in person you can probably do it in selenium: scrape websites with infinite scrolling
Others have concluded that bs4 and requests is not enough for the task: How to load all entries in an infinite scroll at once to parse the HTML in python
Also note that scraping can be a bit of a grey area and that you should always try to be aware and respectful of site policies. Their TOS and robots.txt are always good places to peruse.
Upvotes: 1