Xpath extract all the text between multiple nodes?

Question

I'm scraping an e-commerce web site using the Python requests modul. Well I have some problem on extracting the text between multiple nodes. The following html is the part where I'm trying to extract the text. I need to extract all the text under the (div class="rte ingredients") embedded into the two (p) tags and all the (strong) tags. Pay attention! The (strong) tags can vary from page to page.


    Farina sbiancata arricchita (farina sbiancata di 
    grano, 
    ferro ridotto, vitamine B3-B1-B2-B9), zucchero, 
    agenti lievitanti E500ii-E541i-E341i, destrosio, 
    latte 
    scremato disidratato, olio di 
    soia parzialmente 
    idrogenato, sale, 
    glutine di grano, 
    colorante E170, estratto secco di sciroppo di granoturco, caseinati di 
    sodio (latte), emulsionante E471, regolatore di acidità 
    E270. Può contenere tracce di uova. Contiene OGM.

    Valori nutrizionali (per 100g): energia 348Kcal, lipidi 3.3g (di cui 
    grassi saturi 0g), carboidrati 69.6g (di cui zuccheri 13g), proteine 
    10.9g, sale 2.6g.

Well, I'm using the following code but the result I get is partial.

ingredients = parser.xpath('//*[@id="bottom_right_product_infos"]/section[2]/div/p[1]/text()') 
print ingredients
['Farina sbiancata arricchita (farina sbiancata di']

Conversely I need to extract all the text under the (DIV) tag.

Can somebody help me on this? Thanks!

SIM · Accepted Answer

It seems you are using lxml library. If that is so, the below method should fetch you the full content. Use .text_content() instead of .text in the print statement.

content='''

    Farina sbiancata arricchita (farina sbiancata di 
    grano, 
    ferro ridotto, vitamine B3-B1-B2-B9), zucchero, 
    agenti lievitanti E500ii-E541i-E341i, destrosio, 
    latte 
    scremato disidratato, olio di 
    soia parzialmente 
    idrogenato, sale, 
    glutine di grano, 
    colorante E170, estratto secco di sciroppo di granoturco, caseinati di 
    sodio (latte), emulsionante E471, regolatore di acidità 
    E270. Può contenere tracce di uova. Contiene OGM.

    Valori nutrizionali (per 100g): energia 348Kcal, lipidi 3.3g (di cui 
    grassi saturi 0g), carboidrati 69.6g (di cui zuccheri 13g), proteine 
    10.9g, sale 2.6g.

'''
from lxml.html import fromstring
root = fromstring(content)
for items in root.xpath("//div[contains(@class,'ingredients')]/p"):
    print(items.text_content())  #take a closer look at this .text_content() instead of .text.

Xpath extract all the text between multiple nodes?

Answers (2)

Related Questions