Reputation: 744
I am having HTML source as
<ul class="content">
<li class="">
<div class="profile-card">
<div class="content">
<a href="https://www.linkedin.com/in/ouafae-ezzine-894b113">
Ouafae Ezzine
</a>
<p class="headline">
Organise vos evenements professionnels & personnels
</p>
<dl class="basic">
<dt>
Location
</dt>
<dd>
France
</dd>
<dt>
Industry
</dt>
</dl>
<table class="expanded hide-mobile">
<tbody>
<tr>
<th>
Current
</th>
<td>
Responsable at Blue Med Events
</td>
</tr>
<tr>
<th>
Past
</th>
<td>
Administrateur achats at Pfizer
</td>
</tr>
<tr>
<th>
Education
</th>
<td>
Universite d'Evry Val d'Essonne
</td>
</tr>
<tr>
<th>
Summary
</th>
<td>
Riche d'une experience de plus de 25 ans dans le domaine de l'organisation evenementielle, je mets mon expertise...
</td>
</tr>
</tbody>
</table>
</div>
</div>
</li>
<li class="">
<div class="profile-card">
<div class="content">
<h3>
<a href="https://www.linkedin.com/in/ouafae-ezzine-892855b6">
Ouafae Ezzine
</a>
</h3>
<p class="headline">
Gerante
</p>
<dl class="basic">
<dt>
Location
</dt>
<dd>
France
</dd>
<dt>
Industry
</dt>
<dd>
Events Services
</dd>
</dl>
<table class="expanded hide-mobile">
<tbody>
<tr>
<th>
Current
</th>
<td>
Gerante
</td>
</tr>
</tbody>
</table>
</div>
</div>
</li>
</ul>
I have written a python code which will find if a given string exists in the page or not.
I am trying to write logic to extract the anchor links associated to a particular profile if the string is associated with that profile(anchor tag).
my python snnipet:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('file:///nfs/users/lpediredla/Documents/linkedin/Top2profLinkedIn.html')
ids = driver.find_elements_by_xpath("//*[contains(text(), 'Organise vos evenements professionnels')]")
#don't know how to associate the element with the profile
#please help with the logic here.
driver.close()
I am struck at this point trying to associate the element with the profile bucket it sits in.
Any help is much appreciated.
Upvotes: 1
Views: 162
Reputation: 180522
What you want is preceding-sibling::a
to find the anchor tags before the p tags that contain the text 'Organise vos evenements professionnels'
:
"//p[contains(text(), 'Organise vos evenements professionnels')]/preceding-sibling::a"
Using your html:
In [11]: from lxml.html import fromstring
In [12]: xml = fromstring(html)
In [13]: print(xml.xpath("//p[contains(text(), 'Organise vos evenements professionnels')]/preceding-sibling::a"))
[<Element a at 0x7f5cae670188>]
In [14]: print(xml.xpath("//p[contains(text(), 'Organise vos evenements professionnels')]/preceding-sibling::a//text()"))
['\n Ouafae Ezzine\n ']
If you want to have a case insensitive match you can translate:
"//p[contains(translate(text(),'ORGANISEVOSPRLT','organisevosprlt'), 'organise vos evenements professionnels')]/preceding-sibling::a"
Upvotes: 1