zeusking123
zeusking123

Reputation: 219

How to follow multiple links in scrapy?

[scrapy & python]

I want to follow and extract all the links located at the xpath (//div[@class="work_area_content"]/a'), and go through all the links using the same xpath until the deepest layer of each link. I've tried using the code below: however, it only goes through main layer and doesn't follow each link.

I feel it has something to do with the links variable containing no values in the list. Not sure why the list is empty though.

class DatabloggerSpider(CrawlSpider):
    # The name of the spider
    name = "jobs"

    # The domains that are allowed (links to other domains are skipped)
    allowed_domains = ['1.1.1.1']

    # The URLs to start with
    start_urls = ['1.1.1.1/TestSuites']


    # Method for parsing items
    def parse(self, response):
        # The list of items that are found on the particular page
        items = []
        # Only extract canonicalized and unique links (with respect to the current page)
        test_str = response.text
        # Removes string between two placeholders with regex
        regex = r"(Back to)(.|\n)*?<br><br>"
        regex_response = re.sub(regex, "", test_str)
        regex_response2 = HtmlResponse(regex_response) ##TODO: fix here!

        #print(regex_response2)
        links = LinkExtractor(canonicalize=True, unique=True, restrict_xpaths = ('//div[@class="work_area_content"]/a')).extract_links(regex_response2)
        print(type(links))
        # #Now go through all the found links
        print(links)
        for link in links:
            item = DatabloggerScraperItem()
            item['url_from'] = response.url
            item['url_to'] = link.url
            items.append(item)
            print(items)
        yield scrapy.Request(links, callback=self.parse, dont_filter=True)

        #Return all the found items
        return items

Upvotes: 0

Views: 839

Answers (1)

aleroot
aleroot

Reputation: 72636

I think that you should use a SgmlLinkExtractor with the follow=True parameter set.

Something like :

links = SgmlLinkExtractor(follow=True, restrict_xpaths = ('//div[@class="work_area_content"]/a')).extract_links(regex_response2))

And since you are using a CrawlSpider, you should define rules, take a look at this blog post here for a complete example.

Upvotes: 1

Related Questions