Reputation: 80
When I open the url I want to scrape information from, the HTML code shows everything. But when I web scrape its HTML code it only shows a portion of it, and its not even matching. Now, when the website opens on my browser it does have a loading screen, but I'm not sure that that's the issue. Maybe they blocked people from scraping it? HTML I get back:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title></title>
<base href="/app"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="favicon.ico" rel="icon" type="image/x-icon"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<link href="styles.css" rel="stylesheet"/></head>
<body class="cl">
<app-root>
<div class="loader-wrapper">
<div class="loader"></div>
</div>
</app-root>
<script src="runtime.js" type="text/javascript"></script><script src="polyfills.js" type="text/javascript"></script><script src="scripts.js" type="text/javascript"></script><script src="main.js" type="text/javascript"></script></body>
<script src="https://www.google.com/recaptcha/api.js"></script>
<noscript>
<meta content="0; URL=assets/javascript-warning.html" http-equiv="refresh"/>
</noscript>
</html>
Code I use:
from twill.commands import *
import time
import requests
from bs4 import BeautifulSoup
go('url')
time.sleep(4)
showforms()
try:
fv("1", "username", "username")
fv("1", "password", "*********")
submit('0')
except:
pass
time.sleep(2.5)
url = "url_after_login"
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)
#name_box = soup.find('h1', attrs={'class': 'trend-and-value'})
Upvotes: 0
Views: 1127
Reputation: 4462
It seems, that web page content is generated dynamically by javascript. You can combine selenium / beautiful soup to parse such web page. Advantage of selenium is that it can reproduce user behavior in browser - clicking buttons or links, entering text into input fields etc.
Here is short example:
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
# define 30 seconds delay
DELAY = 30
# define URI
url = '<<WEBSITE_URL>>'
# define options for selenium driver
chrome_options = webdriver.ChromeOptions()
# this one make browser "invisible"
# comment it out to see all actions performed be selenium
chrome_options.add_argument('--headless')
# create selenium web driver
driver = webdriver.Chrome("<PATH_TO_CHROME_DRIVER>", options=chrome_options)
# open web page
driver.get(url)
# wait for h1 element to load for 30 seconds
h1_element = WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'h1.trend-and-value')))
# parse web page content using bs4
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Alternative solution could be analyzing the javascript rendered web page. Usually such web pages retrieve data from backend endpoints in JSON format, which can be called by your scraper as well.
Upvotes: 1