Reputation:
I am scraping a website that has a pagination in it. I was testing the loop and print the output in it from beautifulsoup. When the results are printed, I noticed that the result is not a complete html text. It only includes the first part of the html. Here are my code
from bs4 import BeautifulSoup
import requests
import time
total_pages = 2295
for i in range(1,total_pages,1):
pageNumber = str(i)
url = requests.get("https://www.propertyguru.com.sg/property-for-sale/"+pageNumber+"?order=desc&property_type=N&property_type_code%5B0%5D=CONDO&property_type_code%5B1%5D=APT&property_type_code%5B2%5D=WALK&property_type_code%5B3%5D=CLUS&property_type_code%5B4%5D=EXCON&sort=date").text
soup = BeautifulSoup(url,'html.parser')
print(soup.prettify())
When i print soup.prettify()
the result is this
<!DOCTYPE doctype html>
<!--[if gt IE 9]><!-->
<html class="no-js is-new-brand" lang="en">
<!--<![endif]-->
<head>
<title>
</title>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="app-id=482524585" name="apple-itunes-app">
<meta content="app-id=com.allproperty.android.consumer.sg" name="google-play-app">
<meta content="9iVXbwdOPHOH_byBFBScAHm5x-kvcPzBS_fJBFPBwbo" name="google-site-verification">
<meta content="46acd457be6effa0" name="y_key"/>
<meta content="893837EF69C47405FBAFAB120889A598" name="msvalidate.01"/>
<link href="/images/is-new-brand-favicon.ico" rel="SHORTCUT ICON"/>
<link href="/search.xml" rel="search" title="PropertyGuru Search" type="application/opensearchdescription+xml"/>
<link href="https://cdn.pgimgs.com/1574318624/sf2-search/bundles/guruweblayout/img/is-new-brand-touch-logo.png" rel="apple-touch-icon"/>
<link href="https://cdn.pgimgs.com/1574318624/sf2-search/bundles/guruweblayout/img/is-new-brand-touch-logo.png" rel="android-touch-icon"/>
<script>
// check for browsers without complete flex support ( < IE 10)
window.onload = function(e){
if(Function('/*@cc_on return document.documentMode<=10@*/')()) {
window.location = '/ie-notsupported';
}
};
</script>
<link href="//cdn1.pgimgs.com/1574318624/sg-static/cssprod/propertyguru/layout.css" rel="stylesheet" type="text/css"/>
<link href="//cdn1.pgimgs.com/1574318624/sg-static/cssprod/propertyguru/sg.css" rel="stylesheet" type="text/css"/>
<link href="//cdn1.pgimgs.com/1574318624/sg-static/cssprod/propertyguru/new_styles.css" rel="stylesheet" type="text/css"/>
<script src="//cdn1.pgimgs.com/1574318624/sg-static/jsprod/lib/modernizr-custom.min.js" type="text/javascript">
</script>
<script src="//cdn1.pgimgs.com/1574318624/sg-static/jsprod/jquery-1.12.3.min.js" type="text/javascript">
</script>
<script type="text/javascript">
var guruApp = {"environment":null,"widgetSearch":null,"widgetPoll":null,"widgetGoogleAnalytics":{"dimensions":{"dimension3":"Production","dimension4":"en","dimension13":"SG","dimension14":"web"},"googleAnalyticsObject":null,"config":{"trackingId":"UA-2417512-2","cookieDomain":"propertyguru.com.sg","siteSpeedSampleRate":10}},"userSession":{"user":{"id":null,"username":null,"roles":null,"shortlist":0,"beta":false}},"isResponsive":"false","identityEndpoint":"https:\/\/identity.propertyguru.com\/identity","defaultCurrency":"SGD","googleMaps":{"key":"AIzaSyBlCo7kpcBszvIZoH709avg1rmUjjiop0k"},"googleApis":{"key":"367223124563-is5hdjeal1rr7og4i8ii7t8imihr1dg1.apps.googleusercontent.com"}};
</script>
<link href="https://fonts.googleapis.com/css?family=Roboto:400,500" rel="stylesheet" type="text/css"/>
<link href="https://fonts.googleapis.com/css?family=Nunito:600" rel="stylesheet" type="text/css"/>
<!--[if gt IE 8]><!-->
<link href="https://cdn.pgimgs.com/1574318624/sf2-search/css/legacy_css.css" rel="stylesheet" type="text/css">
<link href="//cdn1.pgimgs.com/1574318624/sg-static/cssprod/rich/fixes.css" rel="stylesheet" type="text/css">
<!--<![endif]-->
<script type="text/javascript">
<!--
var GMAP_KEY = "AIzaSyCUbmYAT3lyhBvao9Yg-WsKtRbMxO-VvVQ";
var REGION = "SG";
var images = [];
var freetextUrl = '//api.propertyguru.com/v1/autocomplete?limit=10&locale=en&format=csv_legacy®ion=sg&objectType=HDB_ESTATE,DISTRICT,PROPERTY,STREET,MRT_STATION,SCHOOL';
//-->
</script>
<!-- GOOGLE AD MANAGER -->
<div class="clearboth">
</div>
<!-- Begin comScore Tag -->
<script>
var _comscore = _comscore || [];
_comscore.push({ c1: "2", c2: "13151479" });
(function() {
var s = document.createElement("script"), el = document.getElementsByTagName("script")[0]; s.async = true;
s.src = (document.location.protocol == "https:" ? "https://sb" : "http://b") + ".scorecardresearch.com/beacon.js";
el.parentNode.insertBefore(s, el);
})();
</script>
<noscript>
<img src="https://sb.scorecardresearch.com/p?c1=2&c2=13151479&cv=2.0&cj=1"/>
</noscript>
<!-- End comScore Tag -->
<!-- GOOGLE ANALYTICS CODE -->
<script src="https://cdn.pgimgs.com/1574318624/sf2-search/bundles/guruweblayout/js/desktop/logger.js" type="text/javascript">
</script>
<script src="https://cdn.pgimgs.com/1574318624/sf2-search/bundles/guruweblayout/js/fingerprint2.min.js" type="text/javascript">
</script>
<script src="https://cdn.pgimgs.com/1574318624/sf2-search/bundles/guruwidget/js/desktop/jquery.widgetGoogleAnalytics.js" type="text/javascript">
</script>
<!-- Google Analytics -->
<script type="text/javascript">
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
</script>
<script type="text/javascript">
if (typeof guruApp != 'undefined' && typeof guruApp.widgetGoogleAnalytics != 'undefined' && guruApp.widgetGoogleAnalytics.googleAnalyticsObject != null) {
guruApp.widgetGoogleAnalytics.googleAnalyticsObject.init();
}
</script>
<script src="https://cdn.pgimgs.com/1574318624/sf2-search/bundles/guruweblayout/js/desktop/jquery.eventDispatcher.js" type="text/javascript">
</script>
<script type="text/javascript">
$(document).ready(function () {
var $body = $('body'),
track = function(category, action, label, value, noninteraction, dimensions) {
label = cleanText(label);
guruApp.widgetGoogleAnalytics.googleAnalyticsObject.trackEvent(category, action, label, value, noninteraction, dimensions);
},
cleanText = function(str) {
return str.replace(/^https?:\/\/[^\/]+/, '').replace(/^\s+/, '').replace(/\s+$/, '').replace(/\s+/, ' ');
};
$body.find('.dropdown .dropdown-menu li.mainnav-areainsider').click(function () {
$body.trigger('ga.mainnav.areainsider.click');
});
});
</script>
<!-- ELOQUA TRACKING CODE -->
<script type="text/javascript">
var _elqQ = _elqQ || [];
_elqQ.push(['elqSetSiteId', '659351510']);
_elqQ.push(['elqTrackPageView']);
(function () {
function async_load() {
var s = document.createElement('script'); s.type = 'text/javascript'; s.async = true;
s.src = '//img03.en25.com/i/elqCfg.min.js';
var x = document.getElementsByTagName('script')[0]; x.parentNode.insertBefore(s, x);
}
if (window.addEventListener) window.addEventListener('DOMContentLoaded', async_load, false);
else if (window.attachEvent) window.attachEvent('onload', async_load);
})();
</script>
<script defer="" src="/pg186791.js" type="text/javascript">
</script>
<style type="text/css">
#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#weeawqsxdstyxxvz{display:none!important}
</style>
</link>
</link>
</meta>
</meta>
</meta>
</head>
<body class="web_filter_recaptcha SG-web_filter_recaptcha layout-web lang-en app-sg legacy is-new-brand" id="web_filter_recaptcha">
<div id="wrapper-outer">
<div id="wrapper">
<div id="wrapper-inner">
<div class="alert alert-warning" id="gdpr-alert" role="alert" style="margin-bottom: 0; display:none;">
To comply with GDPR we will not store any personally identifiable information from you. Therefore we will serve sub-optimal experience where some features such as Login/Signup are disabled. However, you will be able to search and see all the properties, see agent contact details and contact them offline on your own.
</div>
<header class="navbar navbar-default" id="navbar-main">
<div class="header-bg">
<div class="container">
<nav class="header-nav clearfix" role="navigation">
<div class="navbar-header">
<button class="navbar-toggle" type="button">
<span class="sr-only">
Toggle navigation
</span>
<i class="pgicon pg
<!DOCTYPE doctype html>
<!--[if gt IE 9]><!-->
<html class="no-js is-new-brand" lang="en">
<!--<![endif]-->
<head>.....AND SO ON AND SO FOURTH
It only prints some contents but not the whole html contents.
Upvotes: 0
Views: 1170
Reputation: 555
Beautiful-soup library extracts only the view-source of an web page.
Beautiful-soup library is working fine..
from bs4 import BeautifulSoup
import requests
import time
total_pages = 2295
for i in range(1,total_pages,1):
pageNumber = str(i)
url = requests.get("https://www.propertyguru.com.sg/property-for-sale/"+pageNumber+"?order=desc&property_type=N&property_type_code%5B0%5D=CONDO&property_type_code%5B1%5D=APT&property_type_code%5B2%5D=WALK&property_type_code%5B3%5D=CLUS&property_type_code%5B4%5D=EXCON&sort=date").text
soup = BeautifulSoup(url,'html.parser')
print(soup.prettify())
Upvotes: 0
Reputation: 566
You are using requests library, so it does not loads the javascripts. This website is using API to populate the data which use javascript.
You should try using selenium. Selenium will load the whole page with javascript. Then read the page_source and use beautifulsoup.
Upvotes: 1