Reputation: 235
I am trying to parse an xml
file using BeautifulSoup
. Consider a sampleinpt xml
file as follows:
<DOC>
<DOCNO>1</DOCNO>
....
</DOC>
<DOC>
<DOCNO>2</DOCNO>
....
</DOC>
...
This file consists for 130 <DOC>
tags. However, when I tried to parse it using BeautifulSoup's findAll
function, it retrieves a random number of tags (usually between 15 - 25) but never 130. The code I used was as follows:
from bs4 import BeautifulSoup
z = open("filename").read()
soup = BeautifulSoup(z, "lxml")
print len(soup.findAll('doc'))
#more code involving manipulation of results
Can anybody tell me what wrong am I doing? Thanks in advance!
Upvotes: 1
Views: 228
Reputation: 1121346
You are telling BeautifulSoup to use the HTML parser provided by lxml
. If you have an XML document, you should stick to the XML parser option:
soup = BeautifulSoup(z, 'xml')
otherwise the parser will attempt to 'repair' the XML to fit HTML rules. XML parsing in BeautifulSoup is also handled by the lxml
library.
Note that XML is case sensitive so you'll need to search for the DOC
element now.
For XML documents it may be that the ElementTree API offered by lxml
is more productive; it supports XPath queries for example, while BeautifulSoup does not.
However, from your sample it looks like there is no one top level element; it is as if your document consists of a whole series of XML documents instead. This makes your input invalid, and a parser may just stick to only parsing the first element as the top-level document instead.
Upvotes: 2