Reputation: 235
The program is quite simple, recursively descend to directories and extract an element. The directories are 1k with about 200 files of 0.5m. I see that it consumes about 2.5g of memory after some time, it's completely unacceptable, the script's not alone to eat up everything. I cannot understand why it doesn't release the memory. Explicit del doesn't help. Are there any techniques to consider?
from lxml import etree
import os
res=set()
for root, dirs, files in os.walk(basedir):
for i in files:
tree = etree.parse(os.path.join(root,i), parser)
for i in tree.xpath("//a[@class='ctitle']/@href"):
res.add(i)
del tree
Upvotes: 4
Views: 1059
Reputation: 24133
You're keeping references to an element from the tree, an _ElementUnicodeResult
. The element keeps references to its parent. This prevents the whole tree from being garbage collected.
Try converting the element to a string and store that:
from lxml import etree
import os
titles = set()
for root, dirs, files in os.walk(basedir):
for filename in files:
tree = etree.parse(os.path.join(root, filename), parser)
for title in tree.xpath("//a[@class='ctitle']/@href"):
titles.add(str(title))
Upvotes: 4