reyman64
reyman64

Reputation: 553

Generate a correct scrapy hidden input form values for asp doPostBack() function

tldr; My attempts to overwritte the hidden field needed by server to return me a new page of geocaches failed (__EVENTTARGET attributes) , so server return me an empty page.

Ps : My original post was closed du to vote abandon, so i repost here after a the massive edit i produce on the first post.


I try to scrap some webpages which contain cache on a famous geocaching site using Scrapy 1.5.0.

Because you need an account if you want to run this code, i create a new temporary and free account on the website to make some test : dumbuser with password stackoverflow


A) The actual working part of the process :

This first search works without problem, and i have no difficulties to parse the first geocaches.

B) The problem part of the process : requesting next pages

When i try to simulate a click to go to the next page of geocaches. For example going to page 1 to page 2.

enter image description here

The website use ASP with synchronised state between client and server, so we need to go to page1 then page2 then page3 then etc. during the scrap in order to maintain the __VIEWSTATE variable (an hidden input) generated by server between each FORM query.

The link of each number (see the image) call a link with javascript function javascript:__doPostBack(...), which inject content into already existing hidden field before submitting the entire form.

As you can see in the __doPostBack function :

<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
    theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}
//]]>
</script>

Exemple : So when you click on page 2 link, javascript run is javascript:__doPostBack('ctl00$ContentBody$pgrTop$lbGoToPage_2',''). The form is submitted with

C) First try to imitate this behavior :

In order to scrap many pages (limited here to five first pages) i try here to yield five formRequest.from_response query which simply overwrite manually this __EVENTTARGET __EVENTARGUMENT attribute :

def parse_pages(self,response):

    self.parse_cachesList(response)

    ## EXTRACT NUMBER OF PAGES
    links = response.xpath('//td[@class="PageBuilderWidget"]/span/b[3]')
    print(links.extract_first())

    ## Try to extract page 1 to 5 for exemple
    for page in range(1,5):
        yield scrapy.FormRequest.from_response(
            response,
            formxpath="//form[@id='aspnetForm']",
            formdata=
{'__EVENTTARGET':'ctl00$ContentBody$pgrTop$lbGoToPage_'+str(page),
'__EVENTARGUMENT': '',
                      '__LASTFOCUS': ''},
            dont_click=True,
            callback=self.parse_cachesList,
            dont_filter=True
        )

D) Consequence :

The page returned by server is empty, so there is something wrong in my strategy.

When i look at the generated html code returned by server after form post, the __EVENTTARGET is never overwritten by scrapy :

<input id="__EVENTTARGET" name="__EVENTTARGET" type="hidden" value=""/>
<input id="__EVENTARGUMENT" name="__EVENTARGUMENT" type="hidden" value=""/>

E) Question :

Could you help me to understand why scrapy don't replace/overwrite the __EVENTTARGET value here ? Where is the problem in my strategy to simulate users who click to follow each new pages ?

Complete code is downloadable here : code


UPDATE 1 :

Using fiddler, i finally found that the problem is linked to an input : ctl00$ContentBody$chkAll=Check All This input is automatically copied by scrapy.FormRequest.from_response method. If i remove this attribute from POST request, it works. So, how can i remove this field, i try empty without result :

result = scrapy.FormRequest.from_response(
            response,
            formname="aspnetForm",
            formxpath="//form[@id='aspnetForm']",
            formdata={'ctl00$ContentBody$chkAll':'',
                      '__EVENTTARGET':'ctl00$ContentBody$pgrTop$lbGoToPage_2',},
            dont_click=True,
            callback=self.parse_cachesList,
            dont_filter=True,
            meta={'proxy': 'http://localhost:8888'}
            )

Upvotes: 2

Views: 1499

Answers (1)

reyman64
reyman64

Reputation: 553

Solved using lot of patience, and fiddler tool to debug and resend the POST query to the server !

Like update 1 say in my original question, the problem comes from the input ctl00$ContentBody$chkAll in the form.

The way to remove an input into the POST form sent by FormRequest is simple, i found it in the commit here. Set the attribute to None in the formdata dictionnary.

    result = scrapy.FormRequest.from_response(
        response,
        formname="aspnetForm",
        formxpath="//form[@id='aspnetForm']",
        formdata={'ctl00$ContentBody$chkAll':None,
        '__EVENTTARGET':'ctl00$ContentBody$pgrTop$lbGoToPage_2',},
        dont_click=True,
        callback=self.parse_cachesList,
        dont_filter=True
        )

Upvotes: 3

Related Questions