Muhammad Nadeem
Muhammad Nadeem

Reputation: 21

Not able to download complete source code of a web page

I am trying to scrape this web-page using python requests library. But I am not able to download complete html source code. When I use my web-browser to inspect elements, it gives complete html, which I believe can be used for scraping, but when I access this url using python requests library, those html tags which have data are simply disappeared and I am not able to scrape data from those. Here is my sample code :

import requests
from bs4 import BeautifulSoup as BS
import urllib
import http.client
url  = 'https://www.udemy.com/topic/financial-analysis/?lang=en'
user_agent='my-user-agent'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BS(html,'html.parser')

can anybody please help me out?? Thanks

Upvotes: 1

Views: 732

Answers (3)

Muhammad Nadeem
Muhammad Nadeem

Reputation: 21

Thanks to you both, @blakebrojan i tried your method,, but it opened a new chrome instance and display result there,, but what i want is to get source code in my code and scrape data from that code ... here is the code

from selenium import webdriver

driver = webdriver.Chrome('C:\\Users\\Lenovo\\Desktop\\chrome-driver\\chromedriver.exe')
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")

html=driver.page_source

Upvotes: 0

blakebjorn
blakebjorn

Reputation: 163

The page is likely being built by javascript, meaning the site sends over the same source you are pulling from urllib, and then the browser executes the javascript, modifying the source to render the page you are seeing

You will need to use something like selenium, which will open the page in a browser, render the JS, and then return the source e.g.

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
driver.page_source # or driver.execute_script("return document.body.innerHTML;")

Upvotes: 1

Chirag
Chirag

Reputation: 17

I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources. Example: import urllib2 response = urllib2.urlopen("http://google.de") page_source = response.read()

AND...

For parsing the code, have a look at BeautifulSoup.

Upvotes: 0

Related Questions