Reputation: 5629
I basically have the same problem as the guy here: Python high memory usage with BeautifulSoup
My BeautifulSoup objects are not garbage collected, resulting in an important RAM consumption. Here is the code I use ("entry" is an object I get from a RSS web page. It is basically an RSS article).
title = entry.title
date = arrow.get(entry.updated).format('YYYY-MM-DD')
try:
url = entry.feedburner_origlink
except AttributeError:
url = entry.link
abstract = None
graphical_abstract = None
author = None
soup = BeautifulSoup(entry.summary)
r = soup("img", align="center")
print(r)
if r:
graphical_abstract = r[0]['src']
if response.status_code is requests.codes.ok:
soup = BeautifulSoup(response.text)
# Get the title (w/ html)
title = soup("h2", attrs={"class": "alpH1"})
if title:
title = title[0].renderContents().decode().lstrip().rstrip()
# Get the abstrat (w/ html)
r = soup("p", xmlns="http://www.rsc.org/schema/rscart38")
if r:
abstract = r[0].renderContents().decode()
if abstract == "":
abstract = None
r = soup("meta", attrs={"name": "citation_author"})
if r:
author = [tag['content'] for tag in r]
author = ", ".join(author)
So in the doc (http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Improving%20Memory%20Usage%20with%20extract) they say the problem can come from the fact that, as long as you use a tag contained in the soup object, the soup object stays in memory. So I tried something like that (for every time I use a soup object in the previous example):
r = soup("img", align="center")[0].extract()
graphical_abstract = r['src']
But still, the memory is not freed when the program exits the scope.
So, I'm looking for an efficient way to delete a soup object from memory. Do you have any idea ?
Upvotes: 0
Views: 1829
Reputation: 21
To avoid great memory leak of BeautifulSoup objects try to use SoupStrainer class.
It worked perfectly for me.
from bs4 import SoupStrainer
only_span = SoupStrainer('span')
only_div = SoupStrainer('div')
only_h1 = SoupStrainer('h1')
soup_h1 = BeautifulSoup(response.text, 'lxml', parse_only=only_h1)
soup_span = BeautifulSoup(response.text, 'lxml', parse_only=only_span)
soup_div = BeautifulSoup(response.text, 'lxml', parse_only=only_div)
try:
name = soup_h1.find('h1', id='itemTitle').find(text=True, recursive=False)
except:
name = 'Noname'
try:
price = soup_span.find('span', id='prcIsum').text.strip()
etc...
Even if we create three BeautifulSoup objects with using SoupStrainer it'll consume much less RAM, than without SoupStrainer and using only one BeautifulSoup object.
Upvotes: 2
Reputation: 1083
I have had a similar issue and found out that despite my attention I was still storing some BS NavigableString and/or ResultSet which led the soup to stay in memory as you already know. Not sure if both are useful (I let you try) but I remember that extracting text this way fixed the problem
ls_result = [unicode(x) for x in soup_bloc.findAll(text = True)]
str_result = unicode(soup_bloc.text)
Upvotes: 1