CosimoCD
CosimoCD

Reputation: 3790

Xpath extract all the text between multiple nodes?

I'm scraping an e-commerce web site using the Python requests modul. Well I have some problem on extracting the text between multiple nodes. The following html is the part where I'm trying to extract the text. I need to extract all the text under the (div class="rte ingredients") embedded into the two (p) tags and all the (strong) tags. Pay attention! The (strong) tags can vary from page to page.

<div class="rte ingredients">
    <p>Farina sbiancata arricchita (farina sbiancata di 
    <strong>grano</strong>, 
    ferro ridotto, vitamine B3-B1-B2-B9), zucchero, 
    agenti lievitanti E500ii-E541i-E341i, destrosio, 
    <strong>latte</strong> 
    scremato disidratato, olio di 
    <strong>soia</strong> parzialmente 
    idrogenato, sale, 
    <strong>glutine</strong> di <strong>grano</strong>, 
    colorante E170, estratto secco di sciroppo di granoturco, caseinati di 
    sodio (<strong>latte</strong>), emulsionante E471, regolatore di acidità 
    E270. Può contenere tracce di <strong>uova</strong>. Contiene OGM.</p>

    <p>Valori nutrizionali (per 100g): energia 348Kcal, lipidi 3.3g (di cui 
    grassi saturi 0g), carboidrati 69.6g (di cui zuccheri 13g), proteine 
    10.9g, sale 2.6g.</p>
</div>

Well, I'm using the following code but the result I get is partial.

ingredients = parser.xpath('//*[@id="bottom_right_product_infos"]/section[2]/div/p[1]/text()') 
print ingredients
['Farina sbiancata arricchita (farina sbiancata di']

Conversely I need to extract all the text under the (DIV) tag.

Can somebody help me on this? Thanks!

Upvotes: 1

Views: 1781

Answers (2)

SIM
SIM

Reputation: 22440

It seems you are using lxml library. If that is so, the below method should fetch you the full content. Use .text_content() instead of .text in the print statement.

content='''
<div class="rte ingredients">
    <p>Farina sbiancata arricchita (farina sbiancata di 
    <strong>grano</strong>, 
    ferro ridotto, vitamine B3-B1-B2-B9), zucchero, 
    agenti lievitanti E500ii-E541i-E341i, destrosio, 
    <strong>latte</strong> 
    scremato disidratato, olio di 
    <strong>soia</strong> parzialmente 
    idrogenato, sale, 
    <strong>glutine</strong> di <strong>grano</strong>, 
    colorante E170, estratto secco di sciroppo di granoturco, caseinati di 
    sodio (<strong>latte</strong>), emulsionante E471, regolatore di acidità 
    E270. Può contenere tracce di <strong>uova</strong>. Contiene OGM.</p>

    <p>Valori nutrizionali (per 100g): energia 348Kcal, lipidi 3.3g (di cui 
    grassi saturi 0g), carboidrati 69.6g (di cui zuccheri 13g), proteine 
    10.9g, sale 2.6g.</p>
</div>
'''
from lxml.html import fromstring
root = fromstring(content)
for items in root.xpath("//div[contains(@class,'ingredients')]/p"):
    print(items.text_content())  #take a closer look at this .text_content() instead of .text.

Upvotes: 1

kjhughes
kjhughes

Reputation: 111756

The pure XML/XPath solution would be to change the XPath to directly select the string value of the targeted div:

string(/path/to/div)

This way, your XPath should be portable to any conformant XPath library (and you can minimize your need to remember non-standard, idiosyncratic access functions such as text_content()).

Upvotes: 0

Related Questions