Hamza Shafique
Hamza Shafique

Reputation: 11

Scraping content of ASP.NET based website (https://www.proxymonitor.org/) using Scrapy

I'm trying to scrapy 2022 results from proxymonitor.org its ASP.NET based. I've extracted all hidden variables on the website and sending them in FormRequest. But I'm still receiving empty table from the server. Any Idea what I'm missing?

Here is my code:

    from requests import request
    import scrapy
    from scrapy.http import FormRequest
    
    
    class ProxyMonitorSpiderSpider(scrapy.Spider):
        name = 'proxy_monitor_spider'
    
        allowed_domains = ['proxymonitor.org']
        start_urls = [
            'https://www.proxymonitor.org'
        ]
    
        def parse(self, response):
    
            formdata = {
                # response.css('input#__EVENTTARGET::attr(value)').extract_first(),
                '__EVENTTARGET': '',
                '__EVENTARGUMENT': '',
                '__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
                '__VIEWSTATEGENERATOR': response.css('input#__VIEWSTATEGENERATOR::attr(value)').extract_first(),
                '__PREVIOUSPAGE': response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
                '__EVENTVALIDATION': response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
                # 'ctl00_ContentPlaceHolder1_ctrlQuickSearch1_cmbYr1': '2022',
    
    
                # 'DXScript': response.css('input#__DXScript::attr(value)').extract_first(),
    
                # '__CALLBACKID': 'ctl00$ContentPlaceHolder1$ctrlQuickSearch1$CountCallback',
                # '__CALLBACKPARAM': 'c0:',
    
                'ctl00_ContentPlaceHolder1_ctrlQuickSearch1_cmbYr1_VI': '2022',
                'ctl00_ContentPlaceHolder1_ctrlQuickSearch1_cmbYr2_VI': '2022',
    
                'ctl00$ContentPlaceHolder1$ctrlQuickSearch1$cmbYr1': '2022',
                'ctl00$ContentPlaceHolder1$ctrlQuickSearch1$cmbYr1$DDD$L': '2022',
    
    
                'ctl00$ContentPlaceHolder1$ctrlQuickSearch1$cmbYr2': '2022',
                'ctl00$ContentPlaceHolder1$ctrlQuickSearch1$cmbYr2$DDD$L': '2022',
            }
            # print('*** form data',formdata)
            req = scrapy.FormRequest.from_response(
                response, url='https://www.proxymonitor.org/Results.aspx', formdata=formdata, callback=self.parse2)
    
            yield req
    
        def parse2(self, response):
            print('*** status:', response.status)
            with open('response2.html', 'w') as html_file:
                html_file.write(response.text)
    
            for row in response.xpath('//*[@class="dxgvTable_Office2010Silver"]//tbody//tr[position() = 2]'):
                yield {
                    'resolution_name': row.xpath('td[2]//text()').extract_first(),
                    'agm_date': row.xpath('td[3]//text()').extract_first(),
                    'company': row.xpath('td[4]//text()').extract_first(),
                    'lead_filer': row.xpath('td[5]//text()').extract_first(),
                    'status': row.xpath('td[6]//text()').extract_first(),
                }

Upvotes: 1

Views: 83

Answers (1)

Albert D. Kallal
Albert D. Kallal

Reputation: 49089

But what about all the many possible variables in the web page code behind written in c# or whatever? There can be a boatload of server code values and variables. Unless you launch from the login + start page, then HUGE chance that some values and setup variables in the code behind (written in c# or vb.net) is not being set correctly by you. You only have use and setup of client side values, you have ZERO means to setup the server side code values - and they ONLY exist server side. That includes both code variables, and session() values. viewstate is client side, but not always enough.

so code behind is full .net c# or vb.net code. That code behind can have a boatload of variables and values that have to be setup 100% correct.

If you can JUST jump to that URL with your browser, and say click on buttons etc., and the page functions correctly? Then you probably can do this. However, if there are several previous pages you have to navagate to GET TO this current page, then you have to load those previous pages first, click the correct buttons and navagate to the current page.

In many of my pages, I have a good number of code behind class objects, and that's how I drive the code behind. And in many cases, if a value is missing, I jump back to the previous page.

So, I might have a select project to work on. When you select that project, then I create a class with about 10 values that I need for the next page.

You as a result can't "skip" prevous pages. And my code often checks for incorrect values, or missing values.

So, I might have this:

  If IsPostBack = False Then

        If Session("PKID") Is Nothing Then
            ' JUMP to Issues list
            Response.Redirect("Issues")
        End If
        If Session("PKID") = 0 Then
            Response.Redirect("Issues")
        End If

In above, the database PK row id is NEVER exposed client side. As a result, you can never pull data, and NEVER make this page work UNLESS you were on the previous page, clicked on a row to edit, session() value is set, and then I jump to the next web page. You can't touch, nor control session. and the above is a simple example. Often there is a set of 10+ variables in code that that is the result of previous 2-3 pages of navagation.

And, while view state is client side, it is encrpted, and changing those values can also mess up the page and code behind.

This makes web scraping of asp.net applications VERY difficult, since you can't in most cause JUST hit one page, but have to click and navigate the previous pages to GET to the current page, and that is the ONLY way to ensure that the many code behind values and code variables are correctly setup. And such code and setup of values is occurring 100% server side - no client side expose of such values and code ever occurs.

If you can jump directly to that one given URL in a web browser, and the page works 100% correct, then you can achieve your goal. However, if that current page you are on requires previous navigation to get to that page? Then you have to re-produce those steps to GET to that given page, and then automation, and entering values into controls, and clicking on button can and should work.

So, no, you have not extracted all the hidden variables, since the code behind variables and objects are never exposed client side, and clicking on a button will often setup variables and values in the code behind before you click on any button or attempt any other operations on that web page.

If it was not for the .net frame work and the full .net coding system running behind to drive this web page, then this would be trivial. But, the power and ease of writing nice .net code is also something that makes scraping some asp.net pages very difficult, since you attempting to interact with server side .net code, and you don't have use of that code, nor can you change the values used in the code routines behind.

Just keep the above "previous navigation" of pages in mind.

So if you can just jump and hit that web page (in a browser), and it works, then you still have a good chance here, but you can't skip ANY steps that a regular user would take to get to that page and to the point in which they hit a button or some such for the page to return data.

Upvotes: 1

Related Questions