beautifulsoup not parsing xml tag correctly but lxml is

Question

The following code is not working as expected:

import requests
from bs4 import BeautifulSoup
url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=xml&db=pmc'

response = requests.get(url)

soup = BeautifulSoup(response.text,'lxml')
links =  soup.find_all('link')

the links result produces a bunch of ` tags (only some shown):


pmc_sra
SRA
Links to SRA
sra

pmc_structure
Structure Links
Published 3D structures
structure

Printing response.text shows (only partial print shown):


    pmc_sra
    SRA
    Links to SRA
    sra


    pmc_structure
    Structure Links
    Published 3D structures
    structure

Importantly, each Link tag contains other tags, whereas beautifulsoup is suggesting the link tags stand alone.

If I try lxml directly, I get the correct link tags:

from lxml import etree
#root = etree.fromstring(response.text)
root = etree.fromstring(response.text.encode('utf-8'),parser=etree.XMLParser(encoding='utf-8'))

for link in root.iter("Link"):
    etree.dump(link)

produces:


    pmc_sra
    SRA
    Links to SRA
    sra



    pmc_structure
    Structure Links
    Published 3D structures
    structure

Note, I was getting an error with the simpler etree.fromstring call. Perhaps the problem with BeautifulSoup is an encoding problem?

I'm using what I think is the newest BeautifulSoup (4.8.2) and LXML (4.5.0) in Python 3.7.6

David · Accepted Answer

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

I believe the problem is with using "lxml" versus "lxml-xml". If I am correct, Beautiful Soup is trying to parse your XML as HTML and mangling the data therefore.

beautifulsoup not parsing xml tag correctly but lxml is

Answers (1)

Related Questions