Reputation: 1465
I want to scrape the following web page:
https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE=07/16/2019
As you can see, there is lots of data, yet when I "show source", the following html for the data of interest is all there is. Where is all the data coming from? How can something be displayed that isn't in the html?
<div class="Head_W">
<div tabindex="0" tabindex="0" class="Sub_Title">Auctions Waiting</div>
<div class="Fadebar"></div>
<div tabindex="0" class="PageFrame" area="W">
<span class="PageLeft"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle" /></span>
<span tabindex="0" class="PageText">page <input id="curPWA" type="text" curPG="" /> of <span id="maxWA"></span> </span>
<span class="PageRight"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle" /></span>
</div>
<div id="Area_W" class="Auct_Area" ref="Y" arid="W">
<div tabindex="0" class="Loading"></div>
</div>
<div class="Fadebar"></div>
<div tabindex="0" class="PageFrame" area="W">
<span class="PageLeft"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle" /></span>
<span tabindex="0"class="PageText">page <input id="curPWB" type="text" curPG=""/> of <span id="maxWB"></span> </span>
<span class="PageRight"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle" /></span>
</div>
</div>
Upvotes: 0
Views: 1102
Reputation: 12612
The website https://charlotte.realforeclose.com uses AJAX. You need to do some reverse engineering job to find out how does it work.
Open Chrome, press F12 to open Developer Tools or choose the option from the menu.
Open Network tab, choose XHR filter, paste the URL https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE=07/16/2019 to the browser address bar and press enter. Check XHRs logged on Network tab while the webpage is loading. First of all inspect XHRs having bigger response size.
Click on the request in the list and check details. Here are URL, headers and parameters for request.
And the response content.
Since the requests method is GET, you can just paste the URLs to address bar and retrieve the content. The URLs for me are:
https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=W&PageDir=0&doR=1&tx=1563171184890&bypassPage=1&test=1&_=1563171184890
https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=C&PageDir=0&doR=1&tx=1563171185129&bypassPage=0&test=1&_=1563171185129
After playing a bit, you can easily find that parameter AREA=W
is for "Auctions Waiting" section, and AREA=C
is for "Auctions Closed or Canceled" section. Seems the parameters tx
, bypassPage
, test
and _
are not necessary at all.
Open the first page with PageDir=0
and doR=1
, after that navigate to next page with PageDir=1
and doR=0
, and to previous page with PageDir=-1
and doR=0
.
The first page https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=W&PageDir=0&doR=1
And the next page https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=W&PageDir=1&doR=0
Finally you just need to reproduce that XHRs from your application and parse the responses. Depending on implementation of HTTP requests you may need to add the necessary headers and cookies processing also.
Upvotes: 1