Zhen Sun
Zhen Sun

Reputation: 867

parse xml with many roots using BeautifulSoup

I am trying to parse a large xml file downloaded from Google using BS4. However, the file is constructed with many roots so that the xml parser can only parse in the first block.

I load the file using the following command

xml = BeautifulSoup("test.xml", "xml")

The test.xml file looks like below, it has many roots:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" .....>
A LOT of information
</us-patent-grant>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-24.dtd" [ ]>
<us-patent-grant lang="EN" .....>
A LOT of information
</us-patent-grant>

.......

The html parser can read in the full file. However, a regular such file contains over 10k roots. Reading using html parser is slow and eats all my memory. Is there a way to get around this problem?

Any help is appreciated.

Upvotes: 1

Views: 970

Answers (1)

Guy Gavriely
Guy Gavriely

Reputation: 11396

a valid xml file has only one root, either add that single root to the file or tell the parser to parse it as "html" (this is the default) for example:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(open("test.xml"), "xml")
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd">
<us-patent-grant lang="EN">
1
</us-patent-grant>
>>> BeautifulSoup(open("test.xml"))
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd">
<html><body><p>]&gt;
<us-patent-grant lang="EN">
1
</us-patent-grant>
<us-patent-grant lang="EN">
2
</us-patent-grant>
</p></body></html>
>>> 

Upvotes: 1

Related Questions