Reputation: 259
I am experiencing an ugly case of memory leak. I am creating an object with a beutifulsoup, then processing it via its own methods. I am doing this with ~2000 XML files. After processing about half, the program stops working due to MemoryError, and the performance is constantly degrading. I tried to solve it via a soup.decompose method on __del__
and forced gc.collect after processing each file.
class FloorSoup:
def __init__(self, f_id):
only_needed = SoupStrainer(["beacons", 'hint'])
try:
self.f_soup = BeautifulSoup(open("Data/xmls/floors/floors_" + f_id + ".xml", encoding='utf8'), "lxml", parse_only = only_needed)
except (FileNotFoundError):
print("File: Data/xmls/floors/floors_" + f_id + ".xml not found")
def __del__(self):
self.f_soup.decompose()
def find_floor_npcs(self):
found_npcs = set()
for npc in self.f_soup.find_all(text="npc"):
found_npcs.add(npc.parent.parent.values.string)
return found_npcs
def find_floor_hints(self):
hint_ids = set()
print("Finding hints in file")
for hint in self.f_soup.find_all('hint'):
hint_ids.add(hint.localization.string)
return hint_ids
Relevant part of the code I am using to create the object and call the methods:
for q in questSoup.find_all('quest'):
gc.collect()
ql = find_q_line(q)
floors = set()
for f in set(q.find_all('location_id')):
if f.string not in skip_loc:
floor_soup = FloorSoup(f.string)
join_dict(string_by_ql, ql, floor_soup.find_floor_npcs())
join_dict(string_by_ql, ql, floor_soup.find_floor_hints())
del floor_soup
else:
print("Skipping location " + f.string)
By putting the find_floor_hints method out of use, I was able to remove the memory leak almost entirely (or to the point where its effects are negligible). Thus I suspect that the problem might lie in that particular method.
Any help would be greatly appreciated!
Upvotes: 2
Views: 1324
Reputation: 259
Referencing this answer, I was able to remove the leak on the find_floor_hints method by using
hint_ids.add(str(hint.localization.contents))
It seems like the former returned a Navigable String, that seems to leave some (read: an awful lot of) references even after the FloorSoup object is deleted. I am not exactly sure if it is a bug or a feature, but it works.
Upvotes: 2