Umair Ayub
Umair Ayub

Reputation: 21201

Cannot implement recursion with Python Scrapy

Please pardon my knowledge in Scrapy, I have been doing Data Scraping for past 3 years or so using PHP and Python BeautifulSoup, but I am new to Scrapy.

I have Python 2.7 and latest Scrapy.

I have a requirement where I need to scrape http://www.dos.ny.gov/corps/bus_entity_search.html it shows results in paginations.

My requiement is that if a search returns more than 500 results, for example "AME" returns more than 500 results, then code should search for "AMEA" to "AMEZ", and for "AMEA" if it still returns more than 500 results then search "AMEAA" and so on recursively

But it is giving me unexpected results. Here is crawler code.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import FormRequest
from scrapy.http.request import Request

import urllib
from appext20.items import Appext20Item
from scrapy.selector import HtmlXPathSelector

class Appext20Spider(CrawlSpider):
    name = "appext20"

    allowed_domains = ["appext20.dos.ny.gov"]

    # p_entity_name means Keyword to search
    payload = {"p_entity_name": '', "p_name_type": 'A', 'p_search_type':'BEGINS'}

    url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY'

    search_characters = ["A","B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y","Z"," "]

    construction_keywords = ['Carpenters','Carpentry','Plastering','Roofers','Roofing','plumbing','remodelling','remodeling','Tiling','Painting','Rendering','Electrical','Plumber','contracting ','contractor','construction','Waterproofing','Landscaping','Bricklaying','Cabinet Maker','Flooring','carpenters','electricians','restoration','drywall','renovation','renovating ','remodels ','framing','Masonry','builders','Woodwork','Cabinetry','Millwork','Electric','plastering','painters','painting','HVAC','Labouring','Fencing','Concreting','Glass','AC','Heating','glazier ','air duct','tiles','deck','Guttering','Concrete','Demolition','Debris','Dumpster','Cabinet','Junk','stucco','general contract','home improvement','home repair','home build','homes','building maintenance','masons','siding','kitchens','paving','landscapers','landscapes','design & build','design build','design and build']

    search_keywords = ['']


    def start_requests(self):
        # create keywords combo
        for char in self.search_characters:
          for char2 in self.search_characters:
            for char3 in self.search_characters:
              self.search_keywords.extend([char+char2+char3])

        # now start requests
        for keyword in self.search_keywords:
            self.payload['p_entity_name'] = keyword
            print ('this is keyword '+ keyword)
            # parse_data() is my callback func
            yield FormRequest(self.url, formdata= self.payload, callback=self.parse_data)


    def parse_data(self, response):
        ads_on_page = Selector(response).xpath("//td[@headers='c1']")

        # get that message to see how many results this keyword returned.
        # if it returns more than 500, then page shows "More than 500 entities were found. Only the first 500 entities will be displayed."
        try:
            results = Selector(response).xpath("//center/p/text()").extract()[0]
        except Exception,e:
            results = ''

        all_links = []
        for tr in ads_on_page:
            temp_dict = {}
            temp_dict['title'] = tr.xpath('a/text()').extract()
            temp_dict['link'] = tr.xpath('a/@href').extract()
            temp_dict['p_entity_name'] = self.payload['p_entity_name']
            temp_dict['test'] = results
            yield temp_dict

        # check if has next page 
        try:
            next_page = Selector(response).xpath("//a[text()='Next Page']/@href").extract()
            next_page = 'https://appext20.dos.ny.gov/corp_public/' + next_page[0]

            next_page_text = Selector(response).xpath("//a[text()='Next Page']/@href/text()").extract()

            # if it has more than 1 page, then do recursive calls to search
            # I.E: "AME" returns more than 500 resutls, then code should search for "AMEA" to "AMEZ" 
            # and for "AMEA" if it still returns more than 500 results then search "AMEAA" and so on recursively
            if next_page_text == 2:
                if "More than 500 entities were found" in results:
                    # search through "A" to "Z"
                    for char3 in self.search_characters:
                        self.payload['p_entity_name'] = self.payload['p_entity_name'] + char3
                        print ('THIS is keyword '+ self.payload['p_entity_name'])
                        yield FormRequest(self.url, formdata= self.payload, callback=self.parse_data)

            # scrape that next page.
            yield Request(url=next_page, callback=self.parse_data)
        except Exception,e:
            # no next page.
            return

Here is full copy of my project

I am running my code using scrapy crawl appext20 -t csv -o app.csv --loglevel=INFO command.

Upvotes: 0

Views: 134

Answers (1)

jbndlr
jbndlr

Reputation: 5210

Well, without having had a deeper look at scrapy, I had to have a look at the recursion thing.

First, you may want to simplify your keyword generation.

import itertools
import random

URL = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY'
ALPHABET = [chr(i) for i in range(65, 65+26)]

def keyword_initial_set(n=2):
    '''Generates a list of all n-length combinations of the entire alphabet
    E.g. n=2: ['AA', 'AB', 'AC', ..., 'ZY', 'ZZ']
    E.g. n=5: ['AAAAA', 'AAAAB', 'AAAAC', ..., 'ZZZZY', 'ZZZZZ']
    '''
    cartesian = list(itertools.product(*[ALPHABET for i in range(n)]))
    return map((lambda x: ''.join(x)), cartesian)

def keyword_generator(base):
    '''Generates keywords for an additional level for the given keyword base
    E.g. base='BEZ': ['BEZA', 'BEZB', 'BEZC', ..., 'BEZZ']
    '''
    for c in ALPHABET:
        yield base + c

With these little helpers, it is a lot easier to generate your keyword combinatorics and to generate subsequent keywords for a recursive descent (see their docstrings).

Then, for your recursion, it is handy -- as you did in your own code -- to have two separate functions: One for the HTTP request, the other for handling the responses.

def keyword_request(kw):
    '''Issues an online search using a keyword
    WARNING: MONKEY-PATCHED CODE INCLUDED
    '''
    payload = {
        'p_entity_name': kw,
        'p_name_type': 'A',
        'p_search_type': 'BEGINS'
    }
    print('R {}'.format(kw))
    FormRequest(URL, formdata=payload, callback=keyword_parse)

def keyword_parse(response):
    '''Parses the response to seek for the number of results and performs a recursive descent if necessary
    WARNING: MONKEY-PATCHED CODE INCLUDED
    '''
    try:
        n_res = Selector(response).xpath('//center/p/text()').extract()[0]
    except Exception: # Please put specific exception type here. Don't be so generic!
        n_res = ''

    if n_res.startswith('More than 500'):
        print('Recursive descent.')
        for kw in keyword_generator(response['p_entity_name']): # Hacked. If not feasible, get current kw form s/e else
            keyword_request(kw)
    else:
        # Parse paginated results here.
        pass

With these functions, your main method (or call to the crawler wherever it is issued) becomes:

if __name__ == '__main__':
    kwords = keyword_initial_set(n=2)
    for kw in kwords:
        keyword_request(kw)

What happens here?

The keyword_initial_set generates a list of all n-length combinations of the entire alphabet. This serves as a starting point: Each of these keywords is requested from the website search and the results are parsed.

In case the website yields more than 500 results, a recursive descent is performed. The current keyword is extended by all letters A-Z and for each new keyword (of length n+1) a new request is issued and parsed upon completion.

Hope to help.

Monkey Patches

For my local and offline testing, I monkeypatched the original scrapy classes with these ones:

class FormRequest(object):
    '''Monkey-patch for original implementation
    '''
    def __init__(self, url, formdata, callback):
        self.url = url
        self.formdata = formdata
        self.callback = callback

        self.callback(formdata)

class Selector(object):
    '''Monkey-patch for original implementation
    '''
    def __init__(self, response):
        self.response = response

    def xpath(self, xpattern):
        return self

    def extract(self):
        n_res = random.randint(0, 510)
        if n_res > 500:
            return ['More than 500 results found']
        else:
            return ['']

Thus, you may have to adapt the code at those spots where my patches do not hit the original behavior. But you'll surely manage that.

Upvotes: 1

Related Questions