JJJohn
JJJohn

Reputation: 1079

extract text from html string with Scrapy

Here is the html string in question.

<div class="def ddef_d db">a <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/book" title="book">book</a> of grammar <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/rule" title="rules">rules</a>: </div>

With BeautifulSoup, this code

from bs4 import BeautifulSoup
soup = BeautifulSoup(htmltxt, 'lxml')
soup.text

gets me

a book of grammar rules:

which is exactly what I want.

With scrapy, how do I get the same result?

from scrapy import Selector
sel = Selector(text=htmltxt)
sel.css('.ddef_d::text').getall()

this code gets me

['a ', ' of grammar ', ': ']

How should I fix it?

Upvotes: 0

Views: 268

Answers (1)

Roman
Roman

Reputation: 1933

aYou can use this code to get all text inside div and its child:

text = ''.join(sel.css('.ddef_d ::text').getall())
print(text)

your selector returns text only from the div, but part of text located inside child elements (a), that's why you have to add space before ::text to include child text into result.

Upvotes: 1

Related Questions