Reputation: 553
tldr; My attempts to overwritte the hidden field needed by server to return me a new page of geocaches failed (__EVENTTARGET attributes) , so server return me an empty page.
Ps : My original post was closed du to vote abandon, so i repost here after a the massive edit i produce on the first post.
I try to scrap some webpages which contain cache on a famous geocaching site using Scrapy 1.5.0
.
Because you need an account if you want to run this code, i create a new temporary and free account on the website to make some test : dumbuser
with password stackoverflow
A) The actual working part of the process :
https://www.geocaching.com/account/login
France, Haute-Normandie
).This first search works without problem, and i have no difficulties to parse the first geocaches.
B) The problem part of the process : requesting next pages
When i try to simulate a click to go to the next page of geocaches. For example going to page 1 to page 2.
The website use ASP with synchronised state between client and server, so we need to go to page1 then page2 then page3 then etc. during the scrap in order to maintain the __VIEWSTATE
variable (an hidden input) generated by server between each FORM query.
The link of each number (see the image) call a link with javascript function javascript:__doPostBack(...)
, which inject content into already existing hidden field before submitting the entire form.
As you can see in the __doPostBack
function :
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>
Exemple :
So when you click on page 2 link, javascript run is javascript:__doPostBack('ctl00$ContentBody$pgrTop$lbGoToPage_2','')
. The form is submitted with
__EVENTTARGET = ctl00$ContentBody$pgrTop$lbGoToPage_2
__EVENTARGUMENT = ''
C) First try to imitate this behavior :
In order to scrap many pages (limited here to five first pages) i try here to yield
five formRequest.from_response
query which simply overwrite manually this __EVENTTARGET
__EVENTARGUMENT
attribute :
def parse_pages(self,response):
self.parse_cachesList(response)
## EXTRACT NUMBER OF PAGES
links = response.xpath('//td[@class="PageBuilderWidget"]/span/b[3]')
print(links.extract_first())
## Try to extract page 1 to 5 for exemple
for page in range(1,5):
yield scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='aspnetForm']",
formdata=
{'__EVENTTARGET':'ctl00$ContentBody$pgrTop$lbGoToPage_'+str(page),
'__EVENTARGUMENT': '',
'__LASTFOCUS': ''},
dont_click=True,
callback=self.parse_cachesList,
dont_filter=True
)
D) Consequence :
The page returned by server is empty, so there is something wrong in my strategy.
When i look at the generated html code returned by server after form post, the __EVENTTARGET
is never overwritten by scrapy :
<input id="__EVENTTARGET" name="__EVENTTARGET" type="hidden" value=""/>
<input id="__EVENTARGUMENT" name="__EVENTARGUMENT" type="hidden" value=""/>
E) Question :
Could you help me to understand why scrapy don't replace/overwrite the __EVENTTARGET
value here ? Where is the problem in my strategy to simulate users who click to follow each new pages ?
Complete code is downloadable here : code
UPDATE 1 :
Using fiddler, i finally found that the problem is linked to an input : ctl00$ContentBody$chkAll=Check All
This input is automatically copied by scrapy.FormRequest.from_response method. If i remove this attribute from POST request, it works. So, how can i remove this field, i try empty without result :
result = scrapy.FormRequest.from_response(
response,
formname="aspnetForm",
formxpath="//form[@id='aspnetForm']",
formdata={'ctl00$ContentBody$chkAll':'',
'__EVENTTARGET':'ctl00$ContentBody$pgrTop$lbGoToPage_2',},
dont_click=True,
callback=self.parse_cachesList,
dont_filter=True,
meta={'proxy': 'http://localhost:8888'}
)
Upvotes: 2
Views: 1499
Reputation: 553
Solved using lot of patience, and fiddler tool to debug and resend the POST query to the server !
Like update 1 say in my original question, the problem comes from the input ctl00$ContentBody$chkAll
in the form.
The way to remove an input into the POST form sent by FormRequest
is simple, i found it in the commit here. Set the attribute to None
in the formdata
dictionnary.
result = scrapy.FormRequest.from_response(
response,
formname="aspnetForm",
formxpath="//form[@id='aspnetForm']",
formdata={'ctl00$ContentBody$chkAll':None,
'__EVENTTARGET':'ctl00$ContentBody$pgrTop$lbGoToPage_2',},
dont_click=True,
callback=self.parse_cachesList,
dont_filter=True
)
Upvotes: 3