extract text from html string with Scrapy

Question

Here is the html string in question.

a book of grammar rules:

With BeautifulSoup, this code

from bs4 import BeautifulSoup
soup = BeautifulSoup(htmltxt, 'lxml')
soup.text

gets me

a book of grammar rules:

which is exactly what I want.

With scrapy, how do I get the same result?

from scrapy import Selector
sel = Selector(text=htmltxt)
sel.css('.ddef_d::text').getall()

this code gets me

['a ', ' of grammar ', ': ']

How should I fix it?

Roman · Accepted Answer

aYou can use this code to get all text inside div and its child:

text = ''.join(sel.css('.ddef_d ::text').getall())
print(text)

your selector returns text only from the div, but part of text located inside child elements (a), that's why you have to add space before ::text to include child text into result.

extract text from html string with Scrapy

Answers (1)

Related Questions