Reputation: 767
I'm scraping a webpage using multithreading and random proxies. My home PC handles this fine with however many processes are required (in the current code, I've set it to 100). RAM usage seems to hit around 2.5gb. However when I run this on my CentOS VPS I get a generic 'Killed' message and the program terminates. With 100 processes running I get the Killed error very, very quickly. I reduced it to a more reasonable 8 and still got the same error, but after a much longer period. Based on a bit of research I'm making the assumption that the 'Killed' error is related to memory usage. Without multithreading, the error does not occur.
So, what can I do to optimise my code to still run quickly, but not use so much memory? Is my best bet to just reduce the number of processes even further? And can I monitor my memory usage from within Python while the program is running?
Edit: I just realised my VPS has 256mb of RAM vs 24gb on my desktop, which was something I didn't consider when writing the code originally.
#Request soup of url, using random proxy / user agent - try different combinations until valid results are returned
def getsoup(url):
attempts = 0
while True:
try:
proxy = random.choice(working_proxies)
headers = {'user-agent': random.choice(user_agents)}
proxy_dict = {'http': 'http://' + proxy}
r = requests.get(url, headers, proxies=proxy_dict, timeout=5)
soup = BeautifulSoup(r.text, "html5lib") #"html.parser"
totalpages = int(soup.find("div", class_="pagination").text.split(' of ',1)[1].split('\n', 1)[0]) #Looks for totalpages to verify proper page load
currentpage = int(soup.find("div", class_="pagination").text.split('Page ',1)[1].split(' of', 1)[0])
if totalpages < 5000: #One particular proxy wasn't returning pagelimit=60 or offset requests properly ..
break
except Exception as e:
# print 'Error! Proxy: {}, Error msg: {}'.format(proxy,e)
attempts = attempts + 1
if attempts > 30:
print 'Too many attempts .. something is wrong!'
sys.exit()
return (soup, totalpages, currentpage)
#Return soup of page of ads, connecting via random proxy/user agent
def scrape_url(url):
soup, totalpages, currentpage = getsoup(url)
#Extract ads from page soup
###[A bunch of code to extract individual ads from the page..]
# print 'Success! Scraped page #{} of {} pages.'.format(currentpage, totalpages)
sys.stdout.flush()
return ads
def scrapeall():
global currentpage, totalpages, offset
url = "url"
_, totalpages, _ = getsoup(url + "0")
url_list = [url + str(60*i) for i in range(totalpages)]
# Make the pool of workers
pool = ThreadPool(100)
# Open the urls in their own threads and return the results
results = pool.map(scrape_url, url_list)
# Close the pool and wait for the work to finish
pool.close()
pool.join()
flatten_results = [item for sublist in results for item in sublist] #Flattens the list of lists returned by multithreading
return flatten_results
adscrape = scrapeall()
Upvotes: 2
Views: 1218
Reputation: 2253
BeautifulSoup is pure Python library and on a mid range web site it will eat a lot of memory. If it's an option, try replacing it with lxml, which is faster and written in C. It might still run out of memory, if your pages are large though.
As already suggested in the comments, you could use queue.Queue to store responses. A better version would be to retrieve responses to disk, store the filename in a queue, and parse them in a separate process. For that you can use multiprocessing library. If parsing runs out of memory and gets killed, fetching continues. This pattern is known as fork and die and is a common workaround with Python using too much memory.
Then you also need to have a way to see which responses have failed parsing.
Upvotes: 3