Python BeautifulSoup giving different results

I am trying to parse an xml file using BeautifulSoup. Consider a sampleinpt xml file as follows:

<DOC>
<DOCNO>1</DOCNO>
....
</DOC>
<DOC>
<DOCNO>2</DOCNO>
....
</DOC>
...

This file consists for 130 <DOC> tags. However, when I tried to parse it using BeautifulSoup's findAll function, it retrieves a random number of tags (usually between 15 - 25) but never 130. The code I used was as follows:

from bs4 import BeautifulSoup
z = open("filename").read()
soup = BeautifulSoup(z, "lxml")
print len(soup.findAll('doc'))
#more code involving manipulation of results

Can anybody tell me what wrong am I doing? Thanks in advance!

Upvotes: 1

Views: 228

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121346

You are telling BeautifulSoup to use the HTML parser provided by lxml. If you have an XML document, you should stick to the XML parser option:

soup = BeautifulSoup(z, 'xml')

otherwise the parser will attempt to 'repair' the XML to fit HTML rules. XML parsing in BeautifulSoup is also handled by the lxml library.

Note that XML is case sensitive so you'll need to search for the DOC element now.

For XML documents it may be that the ElementTree API offered by lxml is more productive; it supports XPath queries for example, while BeautifulSoup does not.

However, from your sample it looks like there is no one top level element; it is as if your document consists of a whole series of XML documents instead. This makes your input invalid, and a parser may just stick to only parsing the first element as the top-level document instead.

Upvotes: 2

Related Questions