Josh Korsik
Josh Korsik

Reputation: 47

With scrapy, how can get part of xpath parsed result?

Here is my part of spider:

def parse(self, response):

        titles = HtmlXPathSelector(response).select('//li')
        for title in titles:
            item = EksidefeItem()
            item['favori'] = title.select("//*[@id='entry-list']/li/@data-favorite-count").extract()
            item['entry'] = ['<a href=https://eksisozluk.com%s'%a for a in title.select("//*[@class='entry-date permalink']/@href").extract()]
            item['yazari'] = title.select("//*[@id='entry-list']/li/@data-author").extract()
            item['basligi'] = title.select("//*[@id='topic']/h1/@data-title").extract()
            item['tarih'] = title.select("//*[@id='entry-list']/li/footer/div[2]/a[1]/text()").extract()

            return item

I am getting date and time from item['tarih'] but its not exact date and time it also has another values inside it. Here is an example of parsed data from it:

26.01.2017 20:04 ~ 20:07

I want to use only date part (10 characters from left) as

26.01.2017

How can I do that?

Thanks

Upvotes: 1

Views: 165

Answers (2)

salmanwahed
salmanwahed

Reputation: 9647

Consider using item loaders. You can extend the ItemLoader class and write your own custom item loader like this.

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose

def tarih_modifier(value):
    return value[:10]

class MyCustomLoader(ItemLoader):
    default_output_processor = TakeFirst()
    tarih_in = MapCompose(tarih_modifier)

You can also write this class in a separate module. Now in the parse method you can use this loader class.

def parse(self, response):
    l = MyCustomLoader(item=EksidefeItem(), response=response)
    l.add_xpath('name', "//*[@id='entry-list']/li/footer/div[2]/a[1]/text()")
    # add the rest 
    return l.load_item()

Using loader class will give you much more convenience over customizing values.

Upvotes: 1

DBedrenko
DBedrenko

Reputation: 5029

You could use string slicing to get just the date:

item['tarih'] = title.select("//*[@id='entry-list']/li/footer/div[2]/a[1]/text()").extract()
item['tarih'][0] = item['tarih'][0][:10]

But I would also add some validation (take a look at datetime.datetime.strptime()) to make sure you got a valid date.

Upvotes: 0

Related Questions