Reputation: 219
[scrapy & python]
I want to follow
and extract
all the links
located at the xpath (//div[@class="work_area_content"]/a')
, and go through all the links using the same xpath until the deepest layer of each link. I've tried using the code below: however, it only goes through main layer and doesn't follow each link.
I feel it has something to do with the links
variable containing no values in the list. Not sure why the list is empty though.
class DatabloggerSpider(CrawlSpider):
# The name of the spider
name = "jobs"
# The domains that are allowed (links to other domains are skipped)
allowed_domains = ['1.1.1.1']
# The URLs to start with
start_urls = ['1.1.1.1/TestSuites']
# Method for parsing items
def parse(self, response):
# The list of items that are found on the particular page
items = []
# Only extract canonicalized and unique links (with respect to the current page)
test_str = response.text
# Removes string between two placeholders with regex
regex = r"(Back to)(.|\n)*?<br><br>"
regex_response = re.sub(regex, "", test_str)
regex_response2 = HtmlResponse(regex_response) ##TODO: fix here!
#print(regex_response2)
links = LinkExtractor(canonicalize=True, unique=True, restrict_xpaths = ('//div[@class="work_area_content"]/a')).extract_links(regex_response2)
print(type(links))
# #Now go through all the found links
print(links)
for link in links:
item = DatabloggerScraperItem()
item['url_from'] = response.url
item['url_to'] = link.url
items.append(item)
print(items)
yield scrapy.Request(links, callback=self.parse, dont_filter=True)
#Return all the found items
return items
Upvotes: 0
Views: 839
Reputation: 72636
I think that you should use a SgmlLinkExtractor with the follow=True
parameter set.
Something like :
links = SgmlLinkExtractor(follow=True, restrict_xpaths = ('//div[@class="work_area_content"]/a')).extract_links(regex_response2))
And since you are using a CrawlSpider, you should define rules, take a look at this blog post here for a complete example.
Upvotes: 1