GoingMyWay
GoingMyWay

Reputation: 17468

Scrapy, crawl data by onclick

I want to extract the title and the pdf link of each paper in this link: https://iclr.cc/Conferences/2019/Schedule?type=Poster

enter image description here

My code is here

class ICLRCrawler(Spider):
    name = "ICLRCrawler"
    allowed_domains = ["iclr.cc"]
    start_urls = ["https://iclr.cc/Conferences/2019/Schedule?type=Poster", ]

    def parse(self, response):
        papers = Selector(response).xpath('//*[@id="content"]/div/div[@class="paper"]')
        titles = Selector(response).xpath('//*[@id="maincard_704"]/div[3]')
        links = Selector(response).xpath('//*[@id="maincard_704"]/div[6]/a[2]')
        for title, link in zip(titles, links):
            item = PapercrawlerItem()
            item['title'] = title.xpath('text()').extract()[0]
            item['pdf'] = link.xpath('/@href').extract()[0]
            item['sup'] = ''
            yield item 

However, it seems that it is not easy to get the title and link of each paper. Here, how can I change the code to get the data?

Upvotes: 0

Views: 649

Answers (2)

Aditya Tiwari
Aditya Tiwari

Reputation: 1

you have to replace Extract()[0] with get_attribute('href')

Upvotes: 0

gangabass
gangabass

Reputation: 10666

You can use much simpler approach:

def parse(self, response):

    for poster in response.xpath('//div[starts-with(@id, "maincard_")]'):
        item = PapercrawlerItem()
        item["title"] = poster.xpath('.//div[@class="maincardBody"]/text()[1]').get()
        item["pdf"] = poster.xpath('.//a[@title="PDF"]/@href').get()

        yield item

Upvotes: 1

Related Questions