aikipooh
aikipooh

Reputation: 235

Python/lxml is eating too much memory

The program is quite simple, recursively descend to directories and extract an element. The directories are 1k with about 200 files of 0.5m. I see that it consumes about 2.5g of memory after some time, it's completely unacceptable, the script's not alone to eat up everything. I cannot understand why it doesn't release the memory. Explicit del doesn't help. Are there any techniques to consider?


from lxml import etree
import os

res=set()
for root, dirs, files in os.walk(basedir):
    for i in files:
        tree = etree.parse(os.path.join(root,i), parser)
        for i in tree.xpath("//a[@class='ctitle']/@href"):
            res.add(i)
        del tree

Upvotes: 4

Views: 1059

Answers (1)

Open AI - Opting Out
Open AI - Opting Out

Reputation: 24133

You're keeping references to an element from the tree, an _ElementUnicodeResult. The element keeps references to its parent. This prevents the whole tree from being garbage collected.

Try converting the element to a string and store that:

from lxml import etree
import os

titles = set()
for root, dirs, files in os.walk(basedir):
    for filename in files:
        tree = etree.parse(os.path.join(root, filename), parser)
        for title in tree.xpath("//a[@class='ctitle']/@href"):
            titles.add(str(title))

Upvotes: 4

Related Questions