Reputation: 465
I am working on a web scraping tool using python (specifically jupyter notebook) that scrapes a few real estate pages and saves the data like price, adress etc.
It is working just fine for one of the pages I picked out but when I try to scrape this page: sreality.cz (sorry, the page is in Czech but the actual content is not that important now) using reguests.get() I get this result:
<!doctype html>
<html lang="{{ html.lang }}" ng-app="sreality" ng-controller="MainCtrl">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">
<!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS --->
<title ng:bind-template="{{metaSeo.title}}">Sreality.cz • reality a nemovitosti z celé ČR</title>
<meta name="description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz">
<meta property="og:title" content="Sreality.cz • reality a nemovitosti z celé ČR">
<meta property="og:type" content="website">
<meta property="og:image" content="https://www.sreality.cz/img/sreality-logo-og.png">
<meta property="og:description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz">
<meta property="og:url" content="https://www.sreality.cz/">
<meta ng-if="metaStatus.value" name="szn:status" content="{{metaStatus.value}}">
<meta http-equiv="imagetoolbar" content="no">
<link rel="icon" sizes="16x16 32x32 48x48 64x64" href="/img/icons/favicon.ico">
<link rel="apple-touch-icon" sizes="57x57" href="/img/icons/apple-touch-icon-57x57.png?3">
<link rel="apple-touch-icon" sizes="60x60" href="/img/icons/apple-touch-icon-60x60.png?3">
<link rel="apple-touch-icon" sizes="72x72" href="/img/icons/apple-touch-icon-72x72.png?3">
<link rel="apple-touch-icon" sizes="76x76" href="/img/icons/apple-touch-icon-76x76.png?3">
<link rel="apple-touch-icon" sizes="114x114" href="/img/icons/apple-touch-icon-114x114.png?3">
<link rel="apple-touch-icon" sizes="120x120" href="/img/icons/apple-touch-icon-120x120.png?3">
<link rel="apple-touch-icon" sizes="144x144" href="/img/icons/apple-touch-icon-144x144.png?3">
<link rel="apple-touch-icon" sizes="152x152" href="/img/icons/apple-touch-icon-152x152.png?3">
<link rel="apple-touch-icon" sizes="180x180" href="/img/icons/apple-touch-icon-180x180.png?3">
<link rel="icon" type="image/png" sizes="192x192" href="/img/icons/android-chrome-192x192.png">
<link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="96x96" href="/img/icons/favicon-96x96.png">
<link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png">
<link rel="manifest" href="/img/icons/android-chrome-manifest.json">
<meta name="msapplication-TileColor" content="#2b5797">
<meta name="msapplication-TileImage" content="/img/icons/ms-icon-144x144.png">
<meta name="msapplication-config" content="/img/icons/browserconfig.xml" />
<link rel="alternate" type="application/rss+xml" ng-href="{{ rss.url }}" ng-if="rss.url">
<link ng-repeat="lang in metaSeo.languages" rel="alternate" hreflang="{{lang.code}}" ng-href="{{lang.url}}">
<link rel="stylesheet" href="/css/all.css?2e96626">
<!-- Begin Inspectlet Embed Code -->
<script type="text/javascript" id="inspectletjs">
window.__insp = window.__insp || [];
__insp.push(['wid', 821249485]);
__insp.push(["virtualPage"]);
(function() {
function ldinsp(){if(typeof window.__inspld != "undefined") return; window.__inspld = 1; var insp = document.createElement('script'); insp.type = 'text/javascript'; insp.async = true; insp.id = "inspsync"; insp.src = ('https:' == document.location.protocol ? 'https' : 'http') + '://cdn.inspectlet.com/inspectlet.js'; var x = document.getElementsByTagName('script')[0]; x.parentNode.insertBefore(insp, x); };
setTimeout(ldinsp, 500); document.readyState != "complete" ? (window.attachEvent ? window.attachEvent('onload', ldinsp) : window.addEventListener('load', ldinsp, false)) : ldinsp();
})();
</script>
<!-- End Inspectlet Embed Code -->
<!--[if lte IE 8]>
<script>
document.createElement('popover');
document.createElement('mortgage');
document.createElement('vendor');
document.createElement('hp-signpost');
document.createElement('category-switcher');
document.createElement('feedback');
document.createElement('bottom');
document.createElement('panorama');
document.createElement('panorama-prev');
document.createElement('sphere-viewer');
document.createElement('sphere-viewer-prev');
document.createElement('save-filter');
</script>
<![endif]-->
<!-- Statistiky -->
<script src="https://h.imedia.cz/js/dot-small.js" type="text/javascript"></script>
<script type="text/javascript">
(function() {
try {
// Při přesměrování na hashbang URL (IE8-9) ztrácíme referrer,
// který je potřeba pro správné počítání statistik.
if (window.sessionStorage) { // někdo může mít DOM storage zakázaný
var l = document.createElement('a');
l.href = document.referrer;
var referrerHostname = l.hostname;
if (window.location.hostname != referrerHostname) {
window.sessionStorage.setItem('referrer', l.href);
}
}
// Starý android (< 4.0) v kombinaci s angularem špatně pracuje s hashem v URL.
// Považuje ho za součást query případně path.
// Na takových zařízech se budeme tvářit, že žádný hash nebyl.
if (parseInt((/android (\d+)/.exec(window.navigator.userAgent.toLowerCase()) || [])[1], 10) < 4) {
var hrefWithoutHashbang = window.location.href.replace('/#!', '');
var hashIndex = hrefWithoutHashbang.indexOf('#');
if (hashIndex != -1) {
window.location.replace(hrefWithoutHashbang.substring(0, hashIndex));
}
}
} catch (e) {}
})();
</script>
<!-- API mapy.cz -->
<script type="text/javascript" src="https://api4.mapy.cz/loader.js"></script>
<script type="text/javascript">Loader.load(null, {poi: true, pano: true})</script>
<!-- Login reklama -->
<script src="https://i.imedia.cz/js/im3.js" type="text/javascript"></script>
<script src="https://1.im.cz/software/promo/promo-sbrowser.js"></script>
<!-- Rozkopírování SID cookie -->
<script src="https://h.imedia.cz/js/sid.js"></script>
<!-- Login -->
<script src="https://login.szn.cz/js/api/login.js"></script>
<script>
login.cfg({
serviceId: "sreality"
});
</script>
<!-- KONFIGURACE -->
<script src="/js/conf/config.js?2e96626"></script>
<script src="/js/advert.js"></script>
<script src="/js/all.js?2e96626"></script>
<script type="text/javascript">
if (window.DOT) {
var dotCfg = {
service: 'sreality'
};
if (window.SrealityABTest && window.SrealityABTest.getVariant()) {
dotCfg.abtest = window.SrealityABTest.getVariant();
}
DOT.cfg(dotCfg);
}
</script>
<noscript>
<meta http-equiv="refresh" content="0;url=?_escaped_fragment_="/>
</noscript>
<meta name="fragment" content="!" ng-if="metaSeo.showMetaFragment" />
</head>
<!--[if IE 8]> <body class="ie8"> <![endif]-->
<!--[if IE 9]> <body class="notie8 ie9"> <![endif]-->
<!--[if gt IE 9]><!-->
<body class="notie8 notie9 lang-{{html.lang}}">
<!--<![endif]-->
<div loading-line></div>
<div page-layout>
<div ng-view></div>
</div>
</body>
</html>
Though it is different from the one I see when I look at the page in Chrome's developer tool - a part of the code is here (the whole code doesn't fit in here and uploadtext isn't working for some reason):
<!DOCTYPE html>
<html lang="cs" ng-app="sreality" ng-controller="MainCtrl" class="ng-scope"><head><style type="text/css">@charset "UTF-8";[ng\:cloak],[ng-cloak],[data-ng-cloak],[x-ng-cloak],.ng-cloak,.x-ng-cloak,.ng-hide{display:none !important;}ng\:form{display:block;}.ng-animate-block-transitions{transition:0s all!important;-webkit-transition:0s all!important;}.ng-hide-add-active,.ng-hide-remove{display:block!important;}</style>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">
<!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS --->
<title ng:bind-template="Byty na prodej Brno-město, posledních 30 dní • Sreality.cz" class="ng-binding">Byty na prodej Brno-město, posledních 30 dní • Sreality.cz</title>
<meta name="description" content="284 realit v nabídce prodej bytů Brno-město s požadavky: posledních 30 dní. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.">
<meta property="og:title" content="Byty na prodej Brno-město, posledních 30 dní">
<meta property="og:type" content="website">
<meta property="og:image" content="https://www.sreality.cz/img/sreality-logo-og.png">
<meta property="og:description" content="284 realit v nabídce prodej bytů Brno-město s požadavky: posledních 30 dní. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.">
<meta property="og:url" content="https://www.sreality.cz/hledani/prodej/byty/brno?stari=mesic">
<!-- ngIf: metaStatus.value --><meta ng-if="metaStatus.value" name="szn:status" content="200" class="ng-scope"><!-- end ngIf: metaStatus.value -->
<meta http-equiv="imagetoolbar" content="no">
<link rel="icon" sizes="16x16 32x32 48x48 64x64" href="/img/icons/favicon.ico">
<link rel="apple-touch-icon" sizes="57x57" href="/img/icons/apple-touch-icon-57x57.png?3">
<link rel="apple-touch-icon" sizes="60x60" href="/img/icons/apple-touch-icon-60x60.png?3">
<link rel="apple-touch-icon" sizes="72x72" href="/img/icons/apple-touch-icon-72x72.png?3">
<link rel="apple-touch-icon" sizes="76x76" href="/img/icons/apple-touch-icon-76x76.png?3">
<link rel="apple-touch-icon" sizes="114x114" href="/img/icons/apple-touch-icon-114x114.png?3">
<link rel="apple-touch-icon" sizes="120x120" href="/img/icons/apple-touch-icon-120x120.png?3">
<link rel="apple-touch-icon" sizes="144x144" href="/img/icons/apple-touch-icon-144x144.png?3">
<link rel="apple-touch-icon" sizes="152x152" href="/img/icons/apple-touch-icon-152x152.png?3">
<link rel="apple-touch-icon" sizes="180x180" href="/img/icons/apple-touch-icon-180x180.png?3">
<link rel="icon" type="image/png" sizes="192x192" href="/img/icons/android-chrome-192x192.png">
<link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="96x96" href="/img/icons/favicon-96x96.png">
<link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png">
<link rel="manifest" href="/img/icons/android-chrome-manifest.json">
<meta name="msapplication-TileColor" content="#2b5797">
<meta name="msapplication-TileImage" content="/img/icons/ms-icon-144x144.png">
<meta name="msapplication-config" content="/img/icons/browserconfig.xml">
<!-- ngIf: rss.url --><link rel="alternate" type="application/rss+xml" ng-href="/api/cs/v2/estates/rss?category_main_cb=1&locality_district_id=72&suggested_regionId=-1&suggested_districtId=-1&estate_age=31&locality_region_id=14&category_type_cb=1" ng-if="rss.url" class="ng-scope" href="/api/cs/v2/estates/rss?category_main_cb=1&locality_district_id=72&suggested_regionId=-1&suggested_districtId=-1&estate_age=31&locality_region_id=14&category_type_cb=1"><!-- end ngIf: rss.url -->
I can see from the first html code that requests.get downloads that the page runs some scripts which probably cause the html to be different.
I already tried using urllib but the result html doc was still the same.
Is there a way to download the html I see when I open the page in Chromes's developer tool so I can scrape it?
Upvotes: 4
Views: 1331
Reputation: 22440
If eventually data from that page you are after, you can get it very easily using selenium in combination with BeautifulSoup. It gives you all the links of apartments.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.sreality.cz/hledani/prodej/byty/brno?stari=mesic")
soup = BeautifulSoup(driver.page_source,"html.parser")
driver.quit()
for title in soup.select(".text-wrap"):
num = "https://www.sreality.cz" + title.select_one(".title").get('href')
print(num)
Upvotes: 1