Aaron DeVore
Aaron DeVore

Reputation: 73

Multiple tag names in lxml's iterparse?

Is there a way to get multiple tag names from lxml's lxml.etree.iterparse? I have a file-like object with an expensive read operation and many tags, so getting all tags or doing two passes is suboptimal.

Edit: It would be something like Beautiful Soup's find(['tag-1', 'tag-2]), except as an argument to iterparse. Imagine parsing an HTML page for both <td> and <div> tags.

Upvotes: 2

Views: 3417

Answers (2)

Andrei
Andrei

Reputation: 486

I know I'm late for the game, but maybe someone else needs help with the same issue. This code will generate events for both Tag1 and Tag2 tags:

etree.iterparse(io.BytesIO(xml), events=('end',), tag=('Tag1', 'Tag2'))

Upvotes: 10

Peter Gibson
Peter Gibson

Reputation: 19564

I'm not 100% sure what you mean here by "getting all tags", but perhaps this is what you're looking for:

for event, elem in iterparse(file_like_object):
    if elem.tag == 'td' or elem.tag == 'div':
        # reached the end of an interesting tag
        print 'found:', elem.tag
        # possibly quit early to prevent further parsing
        if exit_condition: break

iterparse generates events on the fly during parsing, so you're only reading as much data as is required. However, there's no way you can skip reading elements during parsing, as you wouldn't know how far to skip. In the above, we just ignore tags that we're not interested in.

As you may already know: don't use xml parsers for html. Edit - It turns out that lxml supports html parsing, but you should check the docs to see to what extent.

Upvotes: 4

Related Questions