Matt Darling
Matt Darling

Reputation: 61

Python - Generator function resets between calls?

I'm parsing a language dictionary, represented in an XML file, with ElementTree's iterparse function. I'm filtering it with a generator function, and some weird order of execution misunderstanding is giving me a duplicate entry. Here's some setup code (this is actually happening inside a function, but the other details don't matter):

import xml.etree.cElementTree as ET
dictionary = iter(ET.iterparse("../dictionaries/language_name.xml", 
                  events=("start", "end"))) 
#We can discard the original iterable, I think

Filtering

Then I have a function that receives the iterator and filters it (ignore the global variable, it's just for debugging the problem):

def get_entries(iterparsed):
    global yielded
    root = next(iterparsed)[1] #iterpase gives (event, element)
    yield root

    for event, elem in iterparsed:
        if event == "end" and elem.tag == "entry":
            yielded += 1
            print("Num yielded:", yielded)
            print("Yielding", ET.tostring(elem, encoding="utf-8"))
            yield elem

Processing

Then I use it like this (again, temporary global for debugging):

root = next(get_entries(dictionary))
for elem in get_entries(dictionary):
    global received
    received += 1
    print("Num received:", received)
    print("I got", ET.tostring(elem, encoding="utf-8"))
    raw_input("Continue? ") 
    #I only yield the first item once, but receive it twice? :(
    process_entry(elem) #Defined elsewhere, adds a <sgmtd> node to each entry
    root.clear() #Clears the processed children of root node

Output

If I run through everything, yielded = 9050 while received = 9051. And the problematic output:

Num received: 1
I got <entry><form>aː</form><ortho>a:</ortho><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry>

Continue? 
Num yielded: 1
Yielding <entry><form>aː</form><ortho>a:</ortho><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry>

Num received: 2
I got <entry><form>aː</form><ortho>a:</ortho><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry>

Continue?
Num yielded: 2
Yielding <entry><form>aːčáx</form><ortho>a:cháj</ortho><pos>n</pos><sense><def><en>axe</en><es>hacha</es></def></sense></entry>

Num received: 3
I got <entry><form>aːčáx</form><ortho>a:cháj</ortho><pos>n</pos><sense><def><en>axe</en><es>hacha</es></def></sense></entry>

Continue?

The question

Now, I've checked, and elem isn't defined prior to the loop starting. And no, there aren't two identical elements at the start of the file. After that first "I received" bit, everything seems to be working the way I would expect - things are yielded then received (eg a:cháj axe is yielded first, then received).

Even more oddly, that first element is processed before being yielded - without being cleared at the end of the for loop. The first time it's "received", it has no <sgmtd> node. When it's "yielded" for the first time, it already has a <sgmtd> node, indicating that it's been processed. Then it's received again, and (despite a line saying if not elem.find("sgmtd"): elem.insert(2, segmented_form)) a second <sgmtd> node is added and written out to a file. So then my output file winds up with:

<?xml version="1.0" encoding="UTF-8"?>
<lexicon>
<entry><form>aː</form><ortho>a:</ortho><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry>
<entry><form>aː</form><ortho>a:</ortho><sgmtd /><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry>

So what am I misunderstanding here? How is it that an item is "received" from the generator function without any of the code prior to the yield statement being executed?

It turns out that changing the if not elem.find("sgmtd") line to if elem.find("sgmtd") is None stops the duplicate item from being processed. I guess Element objects don't implicitly convert to True like I expected. But I'd still like to know why it showed up!

Upvotes: 1

Views: 249

Answers (2)

Matt Darling
Matt Darling

Reputation: 61

Both @Chad Miller and @Jochen Ritzel pointed out that I wasn't counting the root element that I was yielding. That was intentional - what I thought would happen is that my generator function would never reset, in the same way that generator objects don't. So when I started to loop with for elem in get_entries(dictionary), I figured the root element would already be consumed.

However, if I add a print statement before yielding the root element, it gets printed twice. The duplication of data I was seeing was caused by elem.insert(2, segmented_form) being called on the root, where segmented_form involves using elem.find (thus searching its children) and grabbing the first element of the tree.

So: the reason I was seeing the duplicates was because generator functions don't behave the same as generator objects. Lesson learned!

Upvotes: 2

Free Monica Cellio
Free Monica Cellio

Reputation: 2290

It looks like your counter in your filter is wrong.

Your counter appears in the inner loop and only increments when that inner loop yields something. But there is no increment for the start of your generator where it says yield root.

Upvotes: 1

Related Questions