Steven Werner
Steven Werner

Reputation: 167

Parsing HTML with requests and BeautifulSoup

I'm not sure if I'm approaching this correctly. I'm using requests to make a GET:

con = s.get(url)

when I call con.content, the whole page is there. But when I pass con into BS:

soup = BeautifulSoup(con.content)
print(soup.a)

I get none. There are lots of tags in there, not behind any JS, that are preset when i call con.content, but when I try to parse with BS most of the page is not there.

Upvotes: 1

Views: 5292

Answers (3)

Md. Mohsin
Md. Mohsin

Reputation: 1832

Change the parser to html5lib

pip install html5lib

And then,

soup = BeautifulSoup(con.content,'html5lib')

Upvotes: 2

mnjeremiah
mnjeremiah

Reputation: 281

Without being able to see you're html you're getting I just did this on the hacker news site and it returns all the a tags as expected.

import requests
from bs4 import BeautifulSoup

s = requests.session()

con = s.get('https://news.ycombinator.com/')

soup = BeautifulSoup(con.text)

links = soup.findAll('a')

for link in links:
    print link

Upvotes: 0

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

The a tags are probably not on the top level.

soup.find_all('a')

is probably what you wanted.

In general, I found lxml to be more reliable, consistent in the API and faster. Yes, even more reliable - I have repeatedly had documents where BeautifulSoup failed to parse them, but lxml in its robust mode lxml.html.soupparser still worked well. And there is the lxml.etree API which is really easy to use.

Upvotes: 1

Related Questions