Stanislav Pavlovič
Stanislav Pavlovič

Reputation: 259

Beautifulsoup memory leak

I am experiencing an ugly case of memory leak. I am creating an object with a beutifulsoup, then processing it via its own methods. I am doing this with ~2000 XML files. After processing about half, the program stops working due to MemoryError, and the performance is constantly degrading. I tried to solve it via a soup.decompose method on __del__ and forced gc.collect after processing each file.

class FloorSoup:
def __init__(self, f_id):
    only_needed = SoupStrainer(["beacons", 'hint'])
    try:
        self.f_soup = BeautifulSoup(open("Data/xmls/floors/floors_" + f_id + ".xml", encoding='utf8'), "lxml", parse_only = only_needed)
    except (FileNotFoundError):
        print("File: Data/xmls/floors/floors_" + f_id + ".xml not found")

def __del__(self):
    self.f_soup.decompose()

def find_floor_npcs(self):
    found_npcs = set()
    for npc in self.f_soup.find_all(text="npc"):
        found_npcs.add(npc.parent.parent.values.string)
    return found_npcs

def find_floor_hints(self):
    hint_ids = set()
    print("Finding hints in file")
    for hint in self.f_soup.find_all('hint'):
        hint_ids.add(hint.localization.string)
    return hint_ids

Relevant part of the code I am using to create the object and call the methods:

for q in questSoup.find_all('quest'):
gc.collect()
ql = find_q_line(q)
floors = set()
for f in set(q.find_all('location_id')):
    if f.string not in skip_loc:
        floor_soup = FloorSoup(f.string)
        join_dict(string_by_ql, ql, floor_soup.find_floor_npcs()) 
        join_dict(string_by_ql, ql, floor_soup.find_floor_hints())
        del floor_soup
    else:
        print("Skipping location " + f.string)

By putting the find_floor_hints method out of use, I was able to remove the memory leak almost entirely (or to the point where its effects are negligible). Thus I suspect that the problem might lie in that particular method.

Any help would be greatly appreciated!

Upvotes: 2

Views: 1324

Answers (1)

Stanislav Pavlovič
Stanislav Pavlovič

Reputation: 259

Referencing this answer, I was able to remove the leak on the find_floor_hints method by using

hint_ids.add(str(hint.localization.contents))

It seems like the former returned a Navigable String, that seems to leave some (read: an awful lot of) references even after the FloorSoup object is deleted. I am not exactly sure if it is a bug or a feature, but it works.

Upvotes: 2

Related Questions