Reputation: 867
I am trying to parse a large xml file downloaded from Google using BS4. However, the file is constructed with many roots so that the xml parser
can only parse in the first block.
I load the file using the following command
xml = BeautifulSoup("test.xml", "xml")
The test.xml file looks like below, it has many roots:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" .....>
A LOT of information
</us-patent-grant>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-24.dtd" [ ]>
<us-patent-grant lang="EN" .....>
A LOT of information
</us-patent-grant>
.......
The html
parser can read in the full file. However, a regular such file contains over 10k roots. Reading using html
parser is slow and eats all my memory. Is there a way to get around this problem?
Any help is appreciated.
Upvotes: 1
Views: 970
Reputation: 11396
a valid xml file has only one root, either add that single root to the file or tell the parser to parse it as "html" (this is the default) for example:
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(open("test.xml"), "xml")
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd">
<us-patent-grant lang="EN">
1
</us-patent-grant>
>>> BeautifulSoup(open("test.xml"))
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd">
<html><body><p>]>
<us-patent-grant lang="EN">
1
</us-patent-grant>
<us-patent-grant lang="EN">
2
</us-patent-grant>
</p></body></html>
>>>
Upvotes: 1