Reputation: 91
I have this code which scrapes usernames:
def fetch_and_parse_names(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "lxml")
return (a.string for a in soup.findAll(href=USERNAME_PATTERN))
def get_names(urls):
# Create a concurrent executor
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
# Apply the fetch-and-parse function concurrently with executor.map,
# and join the results together
return itertools.chain.from_iterable(executor.map(fetch_and_parse_names, urls))
def get_url(region, page):
return 'http://lolprofile.net/leaderboards/%s/%d' % (region, page)
When it starts putting all the names in a list like this
urls = [get_url(region, i) for i in range(start, end + 1)]
names = (name.lower() for name in get_names(urls) if is_valid_name(name))
After an hour off running I get memory allocation errors, obviously I know why this happens but how can I fix it? I was thinking just getting the usernames from a single page and output them to file immediately, delete contents of list, repeat, but I didn't know how to implement this.
Upvotes: 1
Views: 145
Reputation: 69032
The code you use keeps all the downloaded documents in memory for two reasons:
a.string
, which is not just a str
but a bs4.element.NavigableString
and as such keeps a reference to its parent and ultimately to the whole document tree.soup
) until it is used.One way to fix this would be to use:
return [str(a.string) for a in soup.findAll(href=USERNAME_PATTERN)]
This way no references to the soup objects are kept, and the expression is executed immediately and a list of str
s returned.
Upvotes: 2
Reputation: 331
You can use Python Resource Library to increase your process allocated memory as threads of a process use memory of their parent process they cannot allocate extra memory.
Upvotes: 1