Reputation: 1145
I would like to scrape linkedin for a personal only use (need to get post of a friend company page) and I'm using Selenium and BeautifulSoup for this matter.
I found that each post is a div and they all have ember-view
class but sponsored posts also have this class which I don't want to scrape, more digging in the HTML code, I found that I could select user posts by selecting all div that have the value: urn:li:activity:XXXXXXXXXX
for the data-urn
attribute.
However in each post div, XXXXXXX
is a different number, how can I select all div with data-urn=urn:li:activity:XXXXXXXXX
given that XXXXXXXX
is a changing number in each div ?
Upvotes: 1
Views: 139
Reputation: 2469
Another solution.
from simplified_scrapy import SimplifiedDoc,req,utils
html='''
<div>
<div class="ember-view" data-urn="urn:li:activity:123">123</div>
<div class="ember-view" data-urn=urn:li:activity:456>456</div>
<div class="ember-view" data-urn=urn:li:activity:789>789</div>
<div class="ember-view">other</div>
</div>
'''
doc = SimplifiedDoc(html)
# First way
divs = doc.getElementsByReg('data-urn[\s"=]+urn:li:activity:[\d]+',tag="div").text
print (divs)
# Second way
divs = doc.selects('div.ember-view').containsReg('urn:li:activity:[\d]+',attr="data-urn").text
print (divs)
Result:
['123', '456', '789']
['123', '456', '789']
Upvotes: 1