Souames
Souames

Reputation: 1145

How to select element using regex and an attribute

I would like to scrape linkedin for a personal only use (need to get post of a friend company page) and I'm using Selenium and BeautifulSoup for this matter.

I found that each post is a div and they all have ember-view class but sponsored posts also have this class which I don't want to scrape, more digging in the HTML code, I found that I could select user posts by selecting all div that have the value: urn:li:activity:XXXXXXXXXX for the data-urn attribute.

However in each post div, XXXXXXX is a different number, how can I select all div with data-urn=urn:li:activity:XXXXXXXXX given that XXXXXXXX is a changing number in each div ?

Upvotes: 1

Views: 139

Answers (1)

dabingsou
dabingsou

Reputation: 2469

Another solution.

from simplified_scrapy import SimplifiedDoc,req,utils
html='''
<div>
  <div class="ember-view" data-urn="urn:li:activity:123">123</div>
  <div class="ember-view" data-urn=urn:li:activity:456>456</div>
  <div class="ember-view" data-urn=urn:li:activity:789>789</div>
  <div class="ember-view">other</div>
</div>
'''
doc  = SimplifiedDoc(html)
# First way
divs = doc.getElementsByReg('data-urn[\s"=]+urn:li:activity:[\d]+',tag="div").text
print (divs)
# Second way
divs = doc.selects('div.ember-view').containsReg('urn:li:activity:[\d]+',attr="data-urn").text
print (divs)

Result:

['123', '456', '789']
['123', '456', '789']

Upvotes: 1

Related Questions