hemanth30
hemanth30

Reputation: 21

is there any way to handle when href = '#' in scrapy?

while working to scrape all the content from a website called timesjob,i was unable to access the next pages in the website as the href in the page nation class is showing as href = '#',here i could not access such hyperlinks.So i am unable to scrape the data from all the pages is there any way to access to solve the issue of getting the exact hyperlink if so please answer.Thank you. the link that i was trying to access was https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=python&txtLocation=bangalore

Upvotes: 1

Views: 84

Answers (2)

ThePyGuy
ThePyGuy

Reputation: 1035

It's worth noting that you can also play with result size. I had luck getting 1000 on one page here. This will probably help you out a lot. I tried 3400 and it fails you'll have to experiment to find out the limitations. Either way this should make this a much easier task for you.

https://www.timesjobs.com/candidate/job-search.html?from=submit&actualTxtKeywords=python&searchBy=0&rdoOperator=OR&searchType=personalizedSearch&txtLocation=bangalore&luceneResultSize=1000&postWeek=60&txtKeywords=python&pDate=I&sequence=2&startPage=1

This does not solve the problem of navigating to # but it does solve the problem of scraping all results. Also, note that startpage always stays at 1 and they use the sequence variable to paginate.

start_urls = ['https://www.timesjobs.com/candidate/job-search.html?from=submit&actualTxtKeywords=python&searchBy=0&rdoOperator=OR&searchType=personalizedSearch&txtLocation=bangalore&luceneResultSize=1000&postWeek=60&txtKeywords=python&pDate=I&sequence={}&startPage=1']

def start_requests(self):
    for i in range(1, 4):
        yield scrapy.Request(self.start_urls[0].format(i), callback=self.parse)

Upvotes: 1

Ahmed Buksh
Ahmed Buksh

Reputation: 161

You need to debug a bit that what is being done while making a pagination request. Site is not storing hrefs for next page because its a dynamic url which is being generated at runtime. I tested it for page 7 and this is the link which was created

https://www.timesjobs.com/candidate/job-search.html?from=submit&actualTxtKeywords=python&searchBy=0&rdoOperator=OR&searchType=personalizedSearch&txtLocation=bangalore&luceneResultSize=25&postWeek=60&txtKeywords=python&pDate=I&sequence=7&startPage=1

While being on main page, you need to identify total number of pages which are there in page source and then generate list of these requests and hit them. You will get all the data from pagination too

Upvotes: 1

Related Questions