JPFrancoia
JPFrancoia

Reputation: 5629

Python high memory usage with BeautifulSoup: can't delete object

I basically have the same problem as the guy here: Python high memory usage with BeautifulSoup

My BeautifulSoup objects are not garbage collected, resulting in an important RAM consumption. Here is the code I use ("entry" is an object I get from a RSS web page. It is basically an RSS article).

title = entry.title
date = arrow.get(entry.updated).format('YYYY-MM-DD')

try:
    url = entry.feedburner_origlink
except AttributeError:
    url = entry.link

abstract = None
graphical_abstract = None
author = None

soup = BeautifulSoup(entry.summary)

r = soup("img", align="center")
print(r)
if r:
    graphical_abstract = r[0]['src']

if response.status_code is requests.codes.ok:
    soup = BeautifulSoup(response.text)

    # Get the title (w/ html)
    title = soup("h2", attrs={"class": "alpH1"})
    if title:
        title = title[0].renderContents().decode().lstrip().rstrip()

    # Get the abstrat (w/ html)
    r = soup("p", xmlns="http://www.rsc.org/schema/rscart38")
    if r:
        abstract = r[0].renderContents().decode()
        if abstract == "":
            abstract = None

    r = soup("meta", attrs={"name": "citation_author"})
    if r:
        author = [tag['content'] for tag in r]
        author = ", ".join(author)

So in the doc (http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Improving%20Memory%20Usage%20with%20extract) they say the problem can come from the fact that, as long as you use a tag contained in the soup object, the soup object stays in memory. So I tried something like that (for every time I use a soup object in the previous example):

    r = soup("img", align="center")[0].extract()
    graphical_abstract = r['src']

But still, the memory is not freed when the program exits the scope.

So, I'm looking for an efficient way to delete a soup object from memory. Do you have any idea ?

Upvotes: 0

Views: 1829

Answers (2)

babajoga
babajoga

Reputation: 21

To avoid great memory leak of BeautifulSoup objects try to use SoupStrainer class.

It worked perfectly for me.

from bs4 import SoupStrainer

only_span = SoupStrainer('span')
only_div = SoupStrainer('div')
only_h1 = SoupStrainer('h1')

soup_h1 = BeautifulSoup(response.text, 'lxml', parse_only=only_h1)
soup_span = BeautifulSoup(response.text, 'lxml', parse_only=only_span)
soup_div = BeautifulSoup(response.text, 'lxml', parse_only=only_div)


try:
    name = soup_h1.find('h1', id='itemTitle').find(text=True, recursive=False)
except:
    name = 'Noname'

try:
    price = soup_span.find('span', id='prcIsum').text.strip()

etc...

Even if we create three BeautifulSoup objects with using SoupStrainer it'll consume much less RAM, than without SoupStrainer and using only one BeautifulSoup object.

Upvotes: 2

etna
etna

Reputation: 1083

I have had a similar issue and found out that despite my attention I was still storing some BS NavigableString and/or ResultSet which led the soup to stay in memory as you already know. Not sure if both are useful (I let you try) but I remember that extracting text this way fixed the problem

ls_result = [unicode(x) for x in soup_bloc.findAll(text = True)]
str_result = unicode(soup_bloc.text)

Upvotes: 1

Related Questions