Reputation: 86
There are exactly 100 items per page. I'm assuming it is some type of memory limit that's causing it to be killed. Also I have a feeling appending the items to a list variable is most likely not best practice when it comes to memory efficiency. Would opening a text file and writing to it be better? I've done a test with 10 pages and it creates the list successfully taking about 12 seconds to do so. When I try with 9500 pages however, the process gets automatically killed in about an hour.
import requests
from bs4 import BeautifulSoup
import timeit
def lol_scrape():
start = timeit.default_timer()
summoners_listed = []
for i in range(9500):
URL = "https://www.op.gg/leaderboards/tier?region=na&page="+str(i+1)
user_agent = {user-agent}
page = requests.get(URL, headers = user_agent)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find('tbody')
summoners = results.find_all('tr')
for i in range(len(summoners)):
name = summoners[i].find('strong')
summoners_listed.append(name.string)
stop = timeit.default_timer()
print('Time: ', stop - start)
return summoners_listed
Upvotes: 1
Views: 536
Reputation: 86
Credit to @1extralime
All I did was make a csv for every page instead of continually appending to one super long list.
from bs4 import BeautifulSoup
import timeit
import pandas as pd
def lol_scrape():
start = timeit.default_timer()
for i in range(6500):
# Moved variable inside loop to reset it every iteration
summoners_listed = []
URL = "https://www.op.gg/leaderboards/tier?region=na&page="+str(i+1)
user_agent = {user-agent}
page = requests.get(URL, headers = user_agent)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find('tbody')
summoners = results.find_all('tr')
for x in range(len(summoners)):
name = summoners[x].find('strong')
summoners_listed.append(name.string)
# Make a new df with the list values then save to a new csv
df = pd.DataFrame(summoners_listed)
df.to_csv('all_summoners/summoners_page'+str(i+1))
stop = timeit.default_timer()
print('Time: ', stop - start)
Also as a note to my future self or anyone else reading. This method is way superior because had the process failed at anytime I had all the successful csv's saved and could just restart where it left off.
Upvotes: 2