Reputation: 105
I'm trying to scrape data from a webpage on gamespot using beautifulsoup
. However, the result is very different than what I get from the page source viewer
. First off, alot of errors
are produced. For instance, we have
r = requests.get(link)
soup = bs4.BeautifulSoup(r.text)
And yet soup.title
gives
<title>404: Not Found - GameSpot</title>
.
The data I actually want to scrape does not even appear. Is it because the webpage contains javascript
alongside ? If so how can I get around this ?
Upvotes: 0
Views: 313
Reputation: 5387
You're only sending a HTTP request to the server. You need to process Javascript to get the content.
A headless browser with Javascript support, like Ghost, it'd be a good choice.
from ghost import Ghost
ghost = Ghost()
ghost.open(link)
page, resources = ghost.evaluate('document.documentElement.innerHTML;')
soup = BeautifulSoup(page)
.evaluate('document.documentElement.innerHTML')
will show the dynamically generated content, not the static you'd see taking a look at the source.
Upvotes: 1
Reputation: 481
Your connection error is: socket.error: [Errno 54] Connection reset by peer When your first time connect to http://www.gamespot.com you must catch cookie and use it for other pages in header of response.
Upvotes: 0