Reputation: 17468
I want to extract the title and the pdf link of each paper in this link: https://iclr.cc/Conferences/2019/Schedule?type=Poster
My code is here
class ICLRCrawler(Spider):
name = "ICLRCrawler"
allowed_domains = ["iclr.cc"]
start_urls = ["https://iclr.cc/Conferences/2019/Schedule?type=Poster", ]
def parse(self, response):
papers = Selector(response).xpath('//*[@id="content"]/div/div[@class="paper"]')
titles = Selector(response).xpath('//*[@id="maincard_704"]/div[3]')
links = Selector(response).xpath('//*[@id="maincard_704"]/div[6]/a[2]')
for title, link in zip(titles, links):
item = PapercrawlerItem()
item['title'] = title.xpath('text()').extract()[0]
item['pdf'] = link.xpath('/@href').extract()[0]
item['sup'] = ''
yield item
However, it seems that it is not easy to get the title and link of each paper. Here, how can I change the code to get the data?
Upvotes: 0
Views: 649
Reputation: 10666
You can use much simpler approach:
def parse(self, response):
for poster in response.xpath('//div[starts-with(@id, "maincard_")]'):
item = PapercrawlerItem()
item["title"] = poster.xpath('.//div[@class="maincardBody"]/text()[1]').get()
item["pdf"] = poster.xpath('.//a[@title="PDF"]/@href').get()
yield item
Upvotes: 1