Reputation: 1754
Sample input:
<subj code1="textA" code2="textB" code3="textC">
<txt count="1">
<txt id="123">
This is my text.
</txt>
</txt>
</subj>
I am experimenting with BeautifulSoup to extract information from an XML into a CSV. My desired output is
code1,code2,code3,txt
textA,textB,textC,This is my text.
I have been playing with this sample code, which I found here:
It works in regards to extracting txt
but not in code1,code2,code3 in the tag subj
.
if __name__ == '__main__':
with open('sample.csv', 'w') as fhandle:
writer = csv.writer(fhandle)
writer.writerow(('code1', 'code2', 'code3', 'text'))
for subj in soup.find_all('subj'):
for x in subj:
writer.writerow((subj.code1.text,
subj.code2.text,
subj.code3.text,
subj.txt.txt))
but, I cannot get it to recognize also the attributes in subj
that I want to extract.
Any suggestions?
Upvotes: 1
Views: 45
Reputation: 474161
code1
, code2
and code3
are not texts, they are attributes.
In order to access them, treat an element as a dictionary:
(subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)))
Demo:
In [1]: from bs4 import BeautifulSoup
In [2]: data = """
...: <subj code1="textA" code2="textB" code3="textC">
...: <txt count="1">
...: <txt id="123">
...: This is my text.
...: </txt>
...: </txt>
...: </subj>
...: """
In [3]: soup = BeautifulSoup(data, "xml")
In [4]: for subj in soup('subj'):
...: print([subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)])
['textA', 'textB', 'textC', 'This is my text.']
You can also use .get()
to provide a default value if an attribute is missing:
subj.get('code1', 'Default value for code1')
Upvotes: 1