Reputation: 535
I'm using scrapy to scrape this kind of product. I want to scrape data as <li>
between <b>
Indication</b>
and <b>
Contre-indications</b>
and then the next <b></b>
for each there is not predictable keyword.
Here is the source code of the requested page.
<article class="col-md-10 col-md-push-1">
<p><b>Caractéristiques des croquettes pour chat Royal Canin Veterinary Diet - Urinary S/O LP 34 :</b>
</p><ul>
<li>struvite.</li>
<li>la vessie.</li>
<li>d'oxalate de calcium.
</li>
<li>maintien de la muqueuse vésicale </li></ul><p></p>
<p><b>Remarques :</b>
</p><ul>
<li> Urinary S/O Feline</li>
<li>chez le chat âgé, rénal avant la prescription de l'Urinary S/O Feline</li></ul><p></p>
<p><b>Indications :</b>
</p><ul>
<li>dissolution des calculs urinaires de struvite</li>
<li>gestion des récidives d’urolithiase à struvite et à oxalate de calcium dans un seul aliment</li></ul><p></p>
<p><b>Contre-indications :</b>
</p><ul>
<li>insuffisance rénale chronique, acidose métabolique</li>
<li>traitement avec des médicaments acidifiant l'urine</li>
<li>lactation, gestation, croissance</li></ul><p></p>
<p><b>Durée du traitement :</b> 5 à 12 semaines sont nécessaires pour obtenir la dissolution des calculs de struvites.<br>
P</p>
</article>
First approach : with regex, parse as free text. Didn't manage to obtain anything great with this regular formula (<b>[Ii]ndication[s]{0,1}.*?</b>)([\n\r]*.*)(<b>Contre-[Ii]ndication[s]{0,1}.*?</b>)
. It was working okay in the tester but the .re in Python wasn't finding any match. Okay let's move on.
Second Approach : I tried to extract using scrapy :
l.add_xpath('contre_indication','//*[@id="description-panel"]/div/article/b[starts-with(text(),"Contre-indications")]/following-sibling::ul/li/text()')
l.add_xpath('contre_indication','//*[@id="description-panel"]/div/article/p/b[starts-with(text(),"Contre-indications")]/following-sibling::ul/li/text()')
l.add_xpath('indication','//*[@id="description-panel"]/div/article/b[starts-with(text(),"Indication")]/following-sibling::ul/li/text()')
l.add_xpath('indication','//*[@id="description-panel"]/div/article/p/b[starts-with(text(),"Indication")]/following-sibling::ul/li/text()')
Sometimes the keyword xpath is a /b/ alone and sometimes a /p/b. This is the reason why there is two xpath for each.
Here at best I have the whole text between <li>
but with not distinction of Indication/Contre-indications.
Expected output would be :
Indication : ["dissolution des calculs urinaires de struvite","gestion des récidives d’urolithiase à struvite et à oxalate de calcium dans un seul aliment"]
Contre-indication : ["insuffisance rénale chronique, acidose métabolique"..."lactation, gestation, croissance"]
I'm very keen to know the working approach of this kind of problem.
Kind regards
Upvotes: 0
Views: 99
Reputation: 21446
You can acomplish this with xpath
selectors:
'//p[contains(b/text(),"Contre-indications")]/following-sibling::ul[1]/li/text()'
Explaining the xpath:
//p
- select all paragraph nodes
[contains(b/text(),"Contre-indications")]
- that contain some text in child node b's text
//following-sibling::ul[1]
- select sibling of paragraph node that is first of unordered list kind.
//li/text()
- select text of any children that are list nodes
If you run it in scrapy shell:
$ scrapy shell
> body = ...
> from parsel import Selector
> sel = Selector(text=body)
> sel.xpath('//p[contains(b/text(),"Indication")]/following-sibling::ul[1]/li/text()').extract()
['dissolution des calculs urinaires de struvite', 'gestion des récidives d’urolithiase à struvite et à oxalate de calcium dans un seul aliment']
> sel.xpath('//p[contains(b/text(),"Contre-indications")]/following-sibling::ul[1]/li/text()').extract()
['insuffisance rénale chronique, acidose métabolique', "traitement avec des médicaments acidifiant l'urine", 'lactation, gestation, croissance']
Upvotes: 2