Reputation: 105

Errors with Beautiful Soup output

I'm trying to scrape data from a webpage on gamespot using beautifulsoup. However, the result is very different than what I get from the page source viewer. First off, alot of errors are produced. For instance, we have

r = requests.get(link) 

soup = bs4.BeautifulSoup(r.text)

And yet soup.title gives

<title>404: Not Found - GameSpot</title>.

The data I actually want to scrape does not even appear. Is it because the webpage contains javascript alongside ? If so how can I get around this ?

Upvotes: 0

Answers (2)

Nuno André

Reputation: 5387

You're only sending a HTTP request to the server. You need to process Javascript to get the content.

A headless browser with Javascript support, like Ghost, it'd be a good choice.

from ghost import Ghost

ghost = Ghost()

ghost.open(link)
page, resources = ghost.evaluate('document.documentElement.innerHTML;')
soup = BeautifulSoup(page)

.evaluate('document.documentElement.innerHTML') will show the dynamically generated content, not the static you'd see taking a look at the source.

Upvotes: 1

zverianskii

Reputation: 481

Your connection error is: socket.error: [Errno 54] Connection reset by peer When your first time connect to http://www.gamespot.com you must catch cookie and use it for other pages in header of response.

Upvotes: 0

Errors with Beautiful Soup output

Answers (2)

Related Questions