Krishna
Krishna

Reputation: 1

Unable to remove html tags using remove_tags from w3lib.html while using Scrapy

Ran thru the code mentioned @ url: https://www.youtube.com/watch?v=Wp6LRijW9wg

Spider code:

import scrapy
from demo_project.items import JokeItem
from scrapy.loader import ItemLoader

class JokesSpider(scrapy.Spider):
    name = 'jokes'

    start_urls = [
        'http://www.laughfactory.com/jokes/family-jokes'
    ]

    def parse(self, response): 
        for joke in response.xpath("//div[@class = 'jokes']"):
            l = ItemLoader(item = JokeItem(), selector = joke)
            l.add_xpath('joke_text', ".//div[@class = 'joke-text']/p")
            yield l.load_item()
             
        next_page = response.xpath("//li[@class = 'next']/a/@href").extract_first()
        if next_page is not None: 
            next_page_link = response.urljoin(next_page)
            yield scrapy.Request(url = next_page_link, callback = self.parse)

items code:

import scrapy
from itemloaders.processors import MapCompose, TakeFirst
from w3lib.html import remove_tags

def remove_whitespace(value):
    return value.strip()

class JokeItem(scrapy.Item):
    joke_text = scrapy.Field(
        input_processor = MapCompose(remove_tags, remove_whitespace),
        output_processor = TakeFirst()    
    )

On running below command:

scrapy crawl jokes -o data.csv

My csv file still has the html tags included instead of text alone. Can anyone please help me understand why the html tags are not removed?

data.csv file

Upvotes: 0

Views: 809

Answers (1)

SuperUser
SuperUser

Reputation: 4822

Change:

l.add_xpath('joke_text', ".//div[@class = 'joke-text']/p")

to:

l.add_xpath('joke_text', ".//div[@class = 'joke-text']/p//text()")

Upvotes: 1

Related Questions