anuar
anuar

Reputation: 86

Scraping large amount of data with beautifulsoup, process being killed?

There are exactly 100 items per page. I'm assuming it is some type of memory limit that's causing it to be killed. Also I have a feeling appending the items to a list variable is most likely not best practice when it comes to memory efficiency. Would opening a text file and writing to it be better? I've done a test with 10 pages and it creates the list successfully taking about 12 seconds to do so. When I try with 9500 pages however, the process gets automatically killed in about an hour.

import requests
from bs4 import BeautifulSoup
import timeit

def lol_scrape():
  start = timeit.default_timer()

  summoners_listed = []
  for i in range(9500):
    URL = "https://www.op.gg/leaderboards/tier?region=na&page="+str(i+1)
    user_agent = {user-agent}
    page = requests.get(URL, headers = user_agent)
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find('tbody')
    summoners = results.find_all('tr')
    for i in range(len(summoners)):
      name = summoners[i].find('strong')
      summoners_listed.append(name.string)
    
  stop = timeit.default_timer()

  print('Time: ', stop - start)
  return summoners_listed

Upvotes: 1

Views: 536

Answers (1)

anuar
anuar

Reputation: 86

Credit to @1extralime

All I did was make a csv for every page instead of continually appending to one super long list.

from bs4 import BeautifulSoup
import timeit
import pandas as pd

def lol_scrape():
  start = timeit.default_timer()

  for i in range(6500):
    # Moved variable inside loop to reset it every iteration
    summoners_listed = []
    URL = "https://www.op.gg/leaderboards/tier?region=na&page="+str(i+1)
    user_agent = {user-agent}
    page = requests.get(URL, headers = user_agent)
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find('tbody')
    summoners = results.find_all('tr')
    for x in range(len(summoners)):
      name = summoners[x].find('strong')
      summoners_listed.append(name.string)
    
    # Make a new df with the list values then save to a new csv
    df = pd.DataFrame(summoners_listed)
    df.to_csv('all_summoners/summoners_page'+str(i+1))  
    
  stop = timeit.default_timer()

  print('Time: ', stop - start)

Also as a note to my future self or anyone else reading. This method is way superior because had the process failed at anytime I had all the successful csv's saved and could just restart where it left off.

Upvotes: 2

Related Questions