Extracting attributes that are in XML tags with BeautifulSoup4

Question

Sample input:


    
        
            This is my text.

I am experimenting with BeautifulSoup to extract information from an XML into a CSV. My desired output is

code1,code2,code3,txt
textA,textB,textC,This is my text.

I have been playing with this sample code, which I found here: It works in regards to extracting txt but not in code1,code2,code3 in the tag subj.

if __name__ == '__main__':
    with open('sample.csv', 'w') as fhandle:
        writer = csv.writer(fhandle)
        writer.writerow(('code1', 'code2', 'code3', 'text'))
        for subj in soup.find_all('subj'):
            for x in subj:
                writer.writerow((subj.code1.text,
                                subj.code2.text,
                                subj.code3.text,
                                subj.txt.txt))

but, I cannot get it to recognize also the attributes in subj that I want to extract. Any suggestions?

alecxe · Accepted Answer

code1, code2 and code3 are not texts, they are attributes.

In order to access them, treat an element as a dictionary:

(subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)))

Demo:

In [1]: from bs4 import BeautifulSoup

In [2]: data = """
   ...: 
   ...:     
   ...:         
   ...:             This is my text.
   ...:         
   ...:     
   ...: 
   ...: """

In [3]: soup = BeautifulSoup(data, "xml")
In [4]: for subj in soup('subj'):
    ...:     print([subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)])  
['textA', 'textB', 'textC', 'This is my text.']

You can also use .get() to provide a default value if an attribute is missing:

subj.get('code1', 'Default value for code1')

Extracting attributes that are in XML tags with BeautifulSoup4

Answers (1)

Related Questions