owwoow14
owwoow14

Reputation: 1754

Extracting attributes that are in XML tags with BeautifulSoup4

Sample input:

<subj code1="textA" code2="textB" code3="textC">
    <txt count="1">
        <txt id="123">
            This is my text.
        </txt>
    </txt>
</subj>

I am experimenting with BeautifulSoup to extract information from an XML into a CSV. My desired output is

code1,code2,code3,txt
textA,textB,textC,This is my text.

I have been playing with this sample code, which I found here: It works in regards to extracting txt but not in code1,code2,code3 in the tag subj.

if __name__ == '__main__':
    with open('sample.csv', 'w') as fhandle:
        writer = csv.writer(fhandle)
        writer.writerow(('code1', 'code2', 'code3', 'text'))
        for subj in soup.find_all('subj'):
            for x in subj:
                writer.writerow((subj.code1.text,
                                subj.code2.text,
                                subj.code3.text,
                                subj.txt.txt))

but, I cannot get it to recognize also the attributes in subj that I want to extract. Any suggestions?

Upvotes: 1

Views: 45

Answers (1)

alecxe
alecxe

Reputation: 474161

code1, code2 and code3 are not texts, they are attributes.

In order to access them, treat an element as a dictionary:

(subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)))

Demo:

In [1]: from bs4 import BeautifulSoup

In [2]: data = """
   ...: <subj code1="textA" code2="textB" code3="textC">
   ...:     <txt count="1">
   ...:         <txt id="123">
   ...:             This is my text.
   ...:         </txt>
   ...:     </txt>
   ...: </subj>
   ...: """

In [3]: soup = BeautifulSoup(data, "xml")
In [4]: for subj in soup('subj'):
    ...:     print([subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)])  
['textA', 'textB', 'textC', 'This is my text.']

You can also use .get() to provide a default value if an attribute is missing:

subj.get('code1', 'Default value for code1')

Upvotes: 1

Related Questions