Nikola
Nikola

Reputation: 890

HTML DOM basic scraping

I trying to get a specific element from the HTML DOM that appears when you inspect element but for some reason, this is looking into the pure HTML code that doesn't have the javascript executed. Any ideas? The only thing I do differently from the others is that line to avoid 403 error.

import urllib2
from bs4 import BeautifulSoup as BS

#avoid 403 error
request = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0"})

html = urllib2.urlopen(request).read()

soup = BS(html, 'html.parser')

print soup.find('div', {'class' : 'video'})

Upvotes: 0

Views: 60

Answers (1)

pierlauro
pierlauro

Reputation: 176

this is looking into the pure HTML code that doesn't have the javascript executed

The javascript is not parsed by beautifulsoap, you're getting the raw webpage and no script is executed.

The only thing I do differently from the others is that line to avoid 403 error

Urllib2's default user agent string is "Python-urllib/_python_version_", probably the website you're trying to scrape is filtering that user agent; by adding firefox's one, the server is returning you the webpage as if you were visiting it from the browser.

Upvotes: 1

Related Questions