Kennedy Kan
Kennedy Kan

Reputation: 123

Scrapy extract URL (href) element without tags

I have managed to extract some data with the following program. However, when I check over the extract data result, I realise I could not grab the href element (the url) included in the 'question_content' if there is url inside the content.

import scrapy

class JPItem(scrapy.Item):
    best_answer = scrapy.Field()
    question_content = scrapy.Field()
    question_title = scrapy.Field()

class JPSpider(scrapy.Spider):

    name = "jp"
    allowed_domains = ['detail.chiebukuro.yahoo.co.jp']

    start_urls = [
        'https://detail.chiebukuro.yahoo.co.jp/qa/question_detail/q' + str(x)
        for x in range (12174460000,12174470000)
    ]

    def parse(self, response):
        item = JPItem()

        item['question_title'] = response.css("div.mdPstd.mdPstdQstn.sttsRslvd.clrfx div.ttl h1::text").extract_first()
        item['question_content'] = ''.join([i for i in response.css("div.mdPstd.mdPstdQstn.sttsRslvd.clrfx div.ptsQes p::text").extract()])
        item['best_answer'] = ''.join([i for i in response.css("div.mdPstd.mdPstdBA.othrAns.clrfx div.ptsQes p.queTxt::text").extract()])

        yield item

EDIT 1 Question_content that would like to grab

As seen from the picture, there is a url which I am not able to catch it in the "::text" format, but if omitting the "::text", will get other unrelated data and tags like, i.e. br, p.

How can I do to just also grab that link while not including the br and p tag?

Upvotes: 2

Views: 1087

Answers (2)

Tiny.D
Tiny.D

Reputation: 6556

Try this new code:

import scrapy
import re

class JPItem(scrapy.Item):
    best_answer = scrapy.Field()
    question_content = scrapy.Field()
    question_title = scrapy.Field()
    question_link = scrapy.Field()

class JPSpider(scrapy.Spider):

    name = "jp"
    allowed_domains = ['detail.chiebukuro.yahoo.co.jp']

    start_urls = [
        'https://detail.chiebukuro.yahoo.co.jp/qa/question_detail/q12174467757?__ysp=VVNC',
    ]

    def parse(self, response):
        item = JPItem()

        item['question_title'] = response.css("div.mdPstd.mdPstdQstn.sttsRslvd.clrfx div.ttl h1::text").extract_first()
        item['question_content'] = re.sub('[\s+]', '', ''.join([i for i in response.css("div.mdPstd.mdPstdQstn.sttsRslvd.clrfx div.ptsQes p::text").extract()]))
        item['question_link'] = ''.join(response.css("div.mdPstd.mdPstdQstn.sttsRslvd.clrfx div.ptsQes p:not([class]) a::text").extract())
        item['best_answer'] = re.sub('[\s+]', '', ''.join([i for i in response.css("div.mdPstd.mdPstdBA.othrAns.clrfx div.ptsQes p.queTxt::text").extract()]))

        yield item

The the output can give you:

'question_content':'USBについての質問です下記のサイトの通りCentOS7を1USBからインストールしようと思うのですが、USBに焼くとそのUSBは今まで通りに使えなくなってしまうのでしょうか...?(データを出し入れしたり)教えてください~!'

'question_link': u'https://www.skyarch.net/blog/?p=6382'

Upvotes: 1

lufte
lufte

Reputation: 1364

Try extracting the text of the question wrapper plus the text of all of its descendants:

wrapper_selector = "div.mdPstd.mdPstdQstn.sttsRslvd.clrfx div.ptsQes"
item['question_content'] = ''.join([i for i in response.css('{}::text, {} *::text'.format(wrapper_selector)).extract()])

Upvotes: 0

Related Questions