Reputation: 1
Ran thru the code mentioned @ url: https://www.youtube.com/watch?v=Wp6LRijW9wg
Spider code:
import scrapy
from demo_project.items import JokeItem
from scrapy.loader import ItemLoader
class JokesSpider(scrapy.Spider):
name = 'jokes'
start_urls = [
'http://www.laughfactory.com/jokes/family-jokes'
]
def parse(self, response):
for joke in response.xpath("//div[@class = 'jokes']"):
l = ItemLoader(item = JokeItem(), selector = joke)
l.add_xpath('joke_text', ".//div[@class = 'joke-text']/p")
yield l.load_item()
next_page = response.xpath("//li[@class = 'next']/a/@href").extract_first()
if next_page is not None:
next_page_link = response.urljoin(next_page)
yield scrapy.Request(url = next_page_link, callback = self.parse)
items code:
import scrapy
from itemloaders.processors import MapCompose, TakeFirst
from w3lib.html import remove_tags
def remove_whitespace(value):
return value.strip()
class JokeItem(scrapy.Item):
joke_text = scrapy.Field(
input_processor = MapCompose(remove_tags, remove_whitespace),
output_processor = TakeFirst()
)
On running below command:
scrapy crawl jokes -o data.csv
My csv file still has the html tags included instead of text alone. Can anyone please help me understand why the html tags are not removed?
Upvotes: 0
Views: 809
Reputation: 4822
Change:
l.add_xpath('joke_text', ".//div[@class = 'joke-text']/p")
to:
l.add_xpath('joke_text', ".//div[@class = 'joke-text']/p//text()")
Upvotes: 1