root
root

Reputation: 80406

How to loop through Beautiful Soup elements to get attribute values

I need to iterate over Beautiful Soup elements and get the attribute values: For a XML doc:

<?xml version="1.0" encoding="UTF-8"?>

<Document>
    <Page x1="71" y1="120" x2="527" y2="765" type="page" chunkCount="25"
        pageNumber="1" wordCount="172">
        <Chunk x1="206" y1="120" x2="388" y2="144" type="unclassified">
            <Word x1="206" y1="120" x2="214" y2="144" font="Times-Roman" style="font-size:22pt">K</Word>
            <Word x1="226" y1="120" x2="234" y2="144" font="Times-Roman" style="font-size:22pt">O</Word>
        </Chunk>
     </Page>
</Document>

I would like to get the x1 values of the "Word" elements (206,226). Help much appriciated!

EDIT: I have tried:

for i in soup.page.chunk:
    i.word['x1']

that returns an error:

File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 473, in __getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
AttributeError: 'NavigableString' object has no attribute 'word'

while:

soup.page.chunk.word['x1']

works correctly...and:

for i in soup.page.chunk:
    i.findNext(text=True)

gets the text form the element.

Upvotes: 1

Views: 9032

Answers (1)

gorlum0
gorlum0

Reputation: 1455

This seems to work although not that elegant:

for word in soup.page.chunk.find_all('word'):
    print word['x1']

Nested find_all's also should work. But probably it's better to use css-like select (soupselect or from lxml).

Basically if I'm not mistaken soup.page.chunk is a node, soup tag. So if you want iteration you have to call find_all.

upd. different approach could be find_all('word') and then filter on conditions like word.parent.name == 'smth'

[!] in BeautifulSoup3 (not bs4) it should be findAll instead of find_all

Upvotes: 3

Related Questions