Python after running script for a long time memory allocation error

Question

I have this code which scrapes usernames:

def fetch_and_parse_names(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, "lxml")
    return (a.string for a in soup.findAll(href=USERNAME_PATTERN))

def get_names(urls):
    # Create a concurrent executor
    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:

        # Apply the fetch-and-parse function concurrently with executor.map,
        # and join the results together
        return itertools.chain.from_iterable(executor.map(fetch_and_parse_names, urls))

def get_url(region, page):
    return 'http://lolprofile.net/leaderboards/%s/%d' % (region, page)

When it starts putting all the names in a list like this

urls = [get_url(region, i) for i in range(start, end + 1)]
names = (name.lower() for name in get_names(urls) if is_valid_name(name))

After an hour off running I get memory allocation errors, obviously I know why this happens but how can I fix it? I was thinking just getting the usernames from a single page and output them to file immediately, delete contents of list, repeat, but I didn't know how to implement this.

mata · Accepted Answer

The code you use keeps all the downloaded documents in memory for two reasons:

you return a.string, which is not just a str but a bs4.element.NavigableString and as such keeps a reference to its parent and ultimately to the whole document tree.
you return a generator expression, which will capture the local context (in this case the soup) until it is used.

One way to fix this would be to use:

return [str(a.string) for a in soup.findAll(href=USERNAME_PATTERN)]

This way no references to the soup objects are kept, and the expression is executed immediately and a list of strs returned.

Python after running script for a long time memory allocation error

Answers (2)

Related Questions