Scrape data as
between two known keyword encapsulated as tag

Question

I'm using scrapy to scrape this kind of product. I want to scrape data as

between Indicationand Contre-indications and then the next for each there is not predictable keyword.

Here is the source code of the requested page.



Caractéristiques des croquettes pour chat Royal Canin Veterinary Diet - Urinary S/O LP 34 :

struvite.

la vessie.

d'oxalate de calcium.


maintien de la muqueuse vésicale 


Remarques :

 Urinary S/O Feline
chez le chat âgé, rénal avant la prescription de l'Urinary  S/O Feline


Indications :


dissolution des calculs urinaires de struvite
gestion des récidives d’urolithiase à struvite et à oxalate de calcium dans un seul aliment


Contre-indications :

insuffisance rénale chronique, acidose métabolique 
traitement avec des médicaments acidifiant l'urine
lactation, gestation, croissance


Durée du traitement : 5 à 12 semaines sont nécessaires pour obtenir la dissolution des calculs de struvites.

P

First approach : with regex, parse as free text. Didn't manage to obtain anything great with this regular formula ([Ii]ndication[s]{0,1}.*?)([ ]*.*)(Contre-[Ii]ndication[s]{0,1}.*?). It was working okay in the tester but the .re in Python wasn't finding any match. Okay let's move on.

Second Approach : I tried to extract using scrapy :

l.add_xpath('contre_indication','//*[@id="description-panel"]/div/article/b[starts-with(text(),"Contre-indications")]/following-sibling::ul/li/text()')
l.add_xpath('contre_indication','//*[@id="description-panel"]/div/article/p/b[starts-with(text(),"Contre-indications")]/following-sibling::ul/li/text()')
l.add_xpath('indication','//*[@id="description-panel"]/div/article/b[starts-with(text(),"Indication")]/following-sibling::ul/li/text()')
l.add_xpath('indication','//*[@id="description-panel"]/div/article/p/b[starts-with(text(),"Indication")]/following-sibling::ul/li/text()')

Sometimes the keyword xpath is a /b/ alone and sometimes a /p/b. This is the reason why there is two xpath for each. Here at best I have the whole text between

but with not distinction of Indication/Contre-indications.

Expected output would be :

Indication : ["dissolution des calculs urinaires de struvite","gestion des récidives d’urolithiase à struvite et à oxalate de calcium dans un seul aliment"]
Contre-indication : ["insuffisance rénale chronique, acidose métabolique"..."lactation, gestation, croissance"]

I'm very keen to know the working approach of this kind of problem.

Kind regards

Granitosaurus · Accepted Answer

You can acomplish this with xpath selectors:

'//p[contains(b/text(),"Contre-indications")]/following-sibling::ul[1]/li/text()'

Explaining the xpath:

//p - select all paragraph nodes
[contains(b/text(),"Contre-indications")] - that contain some text in child node b's text
//following-sibling::ul[1] - select sibling of paragraph node that is first of unordered list kind.
//li/text() - select text of any children that are list nodes

If you run it in scrapy shell:

$ scrapy shell
> body = ...
> from parsel import Selector
> sel = Selector(text=body)
> sel.xpath('//p[contains(b/text(),"Indication")]/following-sibling::ul[1]/li/text()').extract()
['dissolution des calculs urinaires de struvite', 'gestion des récidives d’urolithiase à struvite et à oxalate de calcium dans un seul aliment']
> sel.xpath('//p[contains(b/text(),"Contre-indications")]/following-sibling::ul[1]/li/text()').extract()
['insuffisance rénale chronique, acidose métabolique', "traitement avec des médicaments acidifiant l'urine", 'lactation, gestation, croissance']

Scrape data as <li> between two known keyword encapsulated as <b> tag

Answers (1)

Related Questions

Scrape data as &lt;li&gt; between two known keyword encapsulated as &lt;b&gt; tag

Answers (1)

Related Questions

Scrape data as <li> between two known keyword encapsulated as <b> tag