Reputation: 3294
The following code is not working as expected:
import requests
from bs4 import BeautifulSoup
url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=xml&db=pmc'
response = requests.get(url)
soup = BeautifulSoup(response.text,'lxml')
links = soup.find_all('link')
the links
result produces a bunch of ` tags (only some shown):
<link/>
<name>pmc_sra</name>
<menu>SRA</menu>
<description>Links to SRA</description>
<dbto>sra</dbto>
<link/>
<name>pmc_structure</name>
<menu>Structure Links</menu>
<description>Published 3D structures</description>
<dbto>structure</dbto>
<link/>
Printing response.text
shows (only partial print shown):
<Link>
<Name>pmc_sra</Name>
<Menu>SRA</Menu>
<Description>Links to SRA</Description>
<DbTo>sra</DbTo>
</Link>
<Link>
<Name>pmc_structure</Name>
<Menu>Structure Links</Menu>
<Description>Published 3D structures</Description>
<DbTo>structure</DbTo>
</Link>
<Link>
Importantly, each Link
tag contains other tags, whereas beautifulsoup is suggesting the link tags stand alone.
If I try lxml
directly, I get the correct link tags:
from lxml import etree
#root = etree.fromstring(response.text)
root = etree.fromstring(response.text.encode('utf-8'),parser=etree.XMLParser(encoding='utf-8'))
for link in root.iter("Link"):
etree.dump(link)
produces:
<Link>
<Name>pmc_sra</Name>
<Menu>SRA</Menu>
<Description>Links to SRA</Description>
<DbTo>sra</DbTo>
</Link>
<Link>
<Name>pmc_structure</Name>
<Menu>Structure Links</Menu>
<Description>Published 3D structures</Description>
<DbTo>structure</DbTo>
</Link>
Note, I was getting an error with the simpler etree.fromstring
call. Perhaps the problem with BeautifulSoup is an encoding problem?
I'm using what I think is the newest BeautifulSoup (4.8.2) and LXML (4.5.0) in Python 3.7.6
Upvotes: 1
Views: 1264
Reputation: 18281
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
I believe the problem is with using "lxml" versus "lxml-xml". If I am correct, Beautiful Soup is trying to parse your XML as HTML and mangling the data therefore.
Upvotes: 2