MITHU
MITHU

Reputation: 164

Can't scrape different results derived from different searches instead of a single result multiple times

I've written a script to parse the linked name populated upon filling in two inputboxes in a webpage (first name and last name) taken from a csv file. That csv file contains thousands of names and I'm trying to scrape the linked name using those names.

The problem is the spider always scrapes the linked name of the last name.

How can I scrape the individual linked name associated with each search?

This is few of the first and last names chronologically for your consideration [taken from the csv file]:

ANTONIO AMADOR      ACOSTA 
JOHN ROBERT         ADAIR 
ROBERT CURTIS       ADAMEK 
CY RITCHIE          ADAMS 

I've tried like this:

import csv
import scrapy
from scrapy.crawler import CrawlerProcess

class AmsrvsSpider(scrapy.Spider):
    name = "amsrvsSpiderscript"
    lead_url = "https://amsrvs.registry.faa.gov/airmeninquiry/Main.aspx"

    def start_requests(self):
        with open("document.csv","r") as f:
            reader = csv.DictReader(f)
            itemlist = [item for item in reader]

        for item in itemlist:
            yield scrapy.Request(self.lead_url,meta={"fname":item['FIRST NAME'],"lname":item['LAST NAME']},dont_filter=True, callback=self.parse)

    def parse(self,response):
        fname = response.meta.get("fname")
        lname = response.meta.get("lname")
        payload = {item.css('::attr(name)').get(default=''):item.css('::attr(value)').get(default='') for item in response.css("input[name]")}
        payload['ctl00$content$ctl01$txtbxFirstName'] = fname
        payload['ctl00$content$ctl01$txtbxLastName'] = lname
        payload.pop('ctl00$content$ctl01$btnClear')
        yield scrapy.FormRequest(self.lead_url,formdata=payload,dont_filter=True,callback=self.parse_content)

    def parse_content(self,response):
        name = response.css("a[id$='lnkbtnAirmenName']::text").get()
        print(name)


if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        'DOWNLOAD_TIMEOUT' : 5,
        'LOG_LEVEL':'ERROR'
    })
    c.crawl(AmsrvsSpider)
    c.start()

This is how the result looks like in that site:

enter image description here

Current output:

CY RITCHIE  ADAMS 
CY RITCHIE  ADAMS 
CY RITCHIE  ADAMS 
CY RITCHIE  ADAMS 

Expected output:

ANTONIO AMADOR  ACOSTA 
JOHN ROBERT     ADAIR 
ROBERT CURTIS   ADAMEK 
CY RITCHIE      ADAMS 

Upvotes: 4

Views: 400

Answers (4)

Ahmed Buksh
Ahmed Buksh

Reputation: 161

Your url is being filtered after one request and that is why it is not scraping other links. You can add a custom setting dont_filter = True in order to avoid duplicate urls You can refer to this for further clearification

Upvotes: 0

MITHU
MITHU

Reputation: 164

This is how I made it successful. Turn out that cookies plays a vital role here. So, it is necessary to handle the cookies in the right way to get the desired output. I also used this COOKIES_ENABLED = True in settings.py.

Working script:

def get_fields():
    with open("brian.csv","r") as f:
        reader = csv.DictReader(f)
        itemlist = [item for item in reader]
    return itemlist

class AmsrvsSpider(scrapy.Spider):
    name = "amsrvs"
    lead_url = 'https://amsrvs.registry.faa.gov/airmeninquiry/Main.aspx'
    start_urls = ['https://amsrvs.registry.faa.gov/airmeninquiry/Main.aspx']

    def parse(self,response):
        payload = {item.css('::attr(name)').get(default=''):item.css('::attr(value)').get(default='') for item in response.css("input[name]")}
        payload.pop('ctl00$content$ctl01$btnClear')

        for i,item in enumerate(get_fields()):
            payload['ctl00$content$ctl01$txtbxFirstName'] = item['FIRST NAME']
            payload['ctl00$content$ctl01$txtbxLastName'] = item['LAST NAME']
            yield scrapy.FormRequest(self.lead_url,formdata=payload,meta={'cookiejar': i},dont_filter=True,callback=self.parse_result)

    def parse_result(self,response):
        item_content = response.css("[id$='lnkbtnAirmenName']::text").get()
        print(item_content)

Upvotes: 3

Guy
Guy

Reputation: 50949

I'm not sure what exactly went wrong, but you can simplify your code by skipping parse()

def start_requests(self):
    with open("document.csv", "r") as f:
        reader = csv.DictReader(f)
        itemlist = [item for item in reader]

    for item in itemlist:
        payload = {'__LASTFOCUS': '',
                   '__VIEWSTATE': '',
                   '__VIEWSTATEGENERATOR': 'B59A47DA',
                   '__EVENTTARGET': '',
                   '__EVENTARGUMENT': '',
                   '__VIEWSTATEENCRYPTED': '',
                   '__EVENTVALIDATION': '',
                   'typAirmenInquiry': '3487',
                   'ctl00$content$ctl01$txtbxLastName': f'{item["LAST NAME"]}',
                   'ctl00$content$ctl01$txtbxCertNo': '',
                   'ctl00$content$ctl01$txtbxFirstName': f'{item["FIRST NAME"]}',
                   'ctl00$content$ctl01$txtbxSearchBirthYear': '',
                   'ctl00$content$ctl01$txtbxCity': '',
                   'ctl00$content$ctl01$btnSearch': 'Search',
                   'hor': 'horizontal',
                   'vert': 'vertical'}

        yield scrapy.FormRequest(self.lead_url, formdata=payload, dont_filter=True, callback=self.parse_content)

def parse_content(self, response):
    name = response.css("a[id$='lnkbtnAirmenName']::text").get()
    print(name)

Output:

ANTONIO AMADOR  ACOSTA  
ROBERT CURTIS  ADAMEK  
CY RITCHIE  ADAMS  
JOHN ROBERT  ADAIR  

Upvotes: 0

Calimocho
Calimocho

Reputation: 378

First of all I believe that the issue you are having comes from how you are converting the DictReader object to a list.

printing itemlist gives:

[OrderedDict([('ANTONIO AMADOR', 'JOHN ROBERT'), (' ACOSTA ', ' ADAIR ')]), OrderedDict([('ANTONIO AMADOR', 'ROBERT CURTIS'), (' ACOSTA ', ' ADAMEK ')]), OrderedDict([('ANTONIO AMADOR', 'CY RITCHIE'), (' ACOSTA ', ' ADAMS ')])]

I don't think this is what you intended. To remedy this I have used csv.reader(f), which reads each row as an (fname, lname) tuple.

I have also made some other changes, as far as I can tell there is no need to request the form page again and again, so this is just requested once and then a form submission is made for each name on the list.

Finally, I also changed the use of white space, just to help with my own readability.

import csv
import scrapy
from scrapy.crawler import CrawlerProcess


class AmsrvsSpider(scrapy.Spider):
    name = "amsrvsSpiderscript"
    lead_url = "https://amsrvs.registry.faa.gov/airmeninquiry/Main.aspx"
    start_urls = ["https://amsrvs.registry.faa.gov/airmeninquiry/Main.aspx"]

    def parse(self, response):
        with open("document.csv", "r") as f:
            split_names = list(csv.reader(f))
        print(split_names)

        default_pload = {item.css('::attr(name)').get(default=''):
                         item.css('::attr(value)').get(default='')
                         for item in response.css("input[name]")
                         }
        default_pload.pop('ctl00$content$ctl01$btnClear')

        for fname, lname in split_names:
            payload = dict(default_pload)
            payload['ctl00$content$ctl01$txtbxFirstName'] = fname
            payload['ctl00$content$ctl01$txtbxLastName'] = lname
            print(fname, lname)
            yield scrapy.FormRequest(self.lead_url,
                                     formdata=payload,
                                     dont_filter=True,
                                     callback=self.parse_content
                                     )

    def parse_content(self, response):
        name = response.css("a[id$='lnkbtnAirmenName']::text").get()
        print(name)


if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        'DOWNLOAD_TIMEOUT': 5,
        'LOG_LEVEL': 'ERROR'
    })
    c.crawl(AmsrvsSpider)
    c.start()

This has resolved the issue you were having with not being able to generate a request for each name, and retrieves the linked name you required.

Note that document.csv I have used for testing is as follows:

ANTONIO AMADOR,ACOSTA 
JOHN ROBERT,ADAIR 
ROBERT CURTIS,ADAMEK 
CY RITCHIE,ADAMS 

Upvotes: 0

Related Questions