Kevin
Kevin

Reputation: 421

lxml is not found within Beautiful Soup

I am trying to use beautifulsoup4 to parse a series of webpages written in XHTML. I am assuming that for best results, I should pair with an xml parser, and the only one supported by beautifulsoup to my knowledge is lxml.

However, when I try to run the following as per the beautifuloup documentation:

import requests

from bs4 import BeautifulSoup 
r = requests.get(‘hereiswhereiputmyurl’)
soup = BeautifulSoup(r.content, ‘xml’)

it results in the following error:

FeatureNotFound: Couldn't find a tree builder with the features you    
requested: xml. Do you need to install a parser library?

Its driving me crazy. I have found record of two other users who posted the same problem

Here How to re-install lxml?

and Here bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

I used this post (see link directly below this line) to reinstall and update lxml and also updated beautiful soup, but I am still getting the error. Installing lxml, libxml2, libxslt on Windows 8.1

Beautifulsoup is working otherwise because I ran the following code and it presented me with its usual wall of markup language soup = BeautifulSoup(r.content, ‘html.parser’)

Here are my specs Windows 8.1 Python 3.5.2 I use the spyder ide in Anaconda 3 to run my code (which admittedly, I do not know much about)

I'm sure its a messup that a beginner would do because as I stated before I have very little programming experience.

How can i resolve this issue, or if it is a known bug, would you guys recommend that I just use lxml by itself to scrape the data.

Upvotes: 9

Views: 17976

Answers (3)

blahblacksheep
blahblacksheep

Reputation: 1

JUST import lxml and then use the parser command. In 2021 if you install lxml using pip, for some reason pycharm still needs to install it every time you write a new program

Upvotes: 0

Eeshaan
Eeshaan

Reputation: 1635

This is a pretty old post, but I had this problem today and found the solution. You need to have lxml installed. Open the terminal and type

pip3 install lxml

Now restart the dev environment (VS Code, Jupyter notebook or whatever) and it should work.

Upvotes: 6

Kaan E.
Kaan E.

Reputation: 515

I think the problem is r.content. Normally it gives the raw content of the response, which is not necessarily an HTML page, it can be json, etc.
Try feeding r.text to soup.

soup = BeautifulSoup(r.text, ‘lxml’)

Better:

r.encoding='utf-8'

then

page = r.text

soup = BeautifulSoup(page, 'lxml')

if you are going to parse xml, you can use 'lxml-xml' as parser.

Upvotes: 1

Related Questions