Python multithreading high memory usage problems

Question

I'm scraping a webpage using multithreading and random proxies. My home PC handles this fine with however many processes are required (in the current code, I've set it to 100). RAM usage seems to hit around 2.5gb. However when I run this on my CentOS VPS I get a generic 'Killed' message and the program terminates. With 100 processes running I get the Killed error very, very quickly. I reduced it to a more reasonable 8 and still got the same error, but after a much longer period. Based on a bit of research I'm making the assumption that the 'Killed' error is related to memory usage. Without multithreading, the error does not occur.

So, what can I do to optimise my code to still run quickly, but not use so much memory? Is my best bet to just reduce the number of processes even further? And can I monitor my memory usage from within Python while the program is running?

Edit: I just realised my VPS has 256mb of RAM vs 24gb on my desktop, which was something I didn't consider when writing the code originally.

#Request soup of url, using random proxy / user agent - try different combinations until valid results are returned
def getsoup(url):
    attempts = 0
    while True:
        try:
            proxy = random.choice(working_proxies)
            headers = {'user-agent': random.choice(user_agents)}  
            proxy_dict = {'http': 'http://' + proxy}
            r = requests.get(url, headers, proxies=proxy_dict, timeout=5)
            soup = BeautifulSoup(r.text, "html5lib") #"html.parser"
            totalpages = int(soup.find("div",  class_="pagination").text.split(' of ',1)[1].split('
', 1)[0])  #Looks for totalpages to verify proper page load 
            currentpage = int(soup.find("div",  class_="pagination").text.split('Page ',1)[1].split(' of', 1)[0])
            if totalpages < 5000: #One particular proxy wasn't returning pagelimit=60 or offset requests properly ..            
                break
        except Exception as e:
            # print 'Error! Proxy: {}, Error msg: {}'.format(proxy,e)
            attempts = attempts + 1        
            if attempts > 30:
                print 'Too many attempts .. something is wrong!'
                sys.exit()
    return (soup, totalpages, currentpage)

#Return soup of page of ads, connecting via random proxy/user agent
def scrape_url(url):
    soup, totalpages, currentpage = getsoup(url)               
    #Extract ads from page soup

    ###[A bunch of code to extract individual ads from the page..]

    # print 'Success! Scraped page #{} of {} pages.'.format(currentpage, totalpages)
    sys.stdout.flush()
    return ads     

def scrapeall():     
    global currentpage, totalpages, offset
    url = "url"

    _, totalpages, _ = getsoup(url + "0")
    url_list = [url + str(60*i) for i in range(totalpages)]

    # Make the pool of workers
    pool = ThreadPool(100)    
    # Open the urls in their own threads and return the results
    results = pool.map(scrape_url, url_list)
    # Close the pool and wait for the work to finish
    pool.close()
    pool.join()

    flatten_results = [item for sublist in results for item in sublist] #Flattens the list of lists returned by multithreading
    return flatten_results

adscrape = scrapeall()

Python multithreading high memory usage problems

Answers (1)

Related Questions