Reputation: 13192
Here's the issue . im trying to scrape this facebook about page for the birthday date and when I see the page source in the browser , it shows me the birthday date as a comment in html within a div
of classname class="hidden_elem"
.
It might that becoz of this, when I see the source code of this page in my get request using (selenium , scrapy , requests) all I get just a div
with class="hidden_elem"
and that comment is nowhere to be seen let alone parsing it for info.
So how to get this text and if possible please show how to get the birthday dates too.
There might be some javascript things which is trickily causing this by design on the facebook page. how to get around this ?
Here is the URL from which im trying to get the birthday dates . https://www.facebook.com/profile.php?id=100004456147835&sk=about
From the source page of the browser it looks like this :-
<div class="hidden_elem"><code id="u_0_2g"><!-- <ul class="uiList _54nz _4kg _4kt" data-pnref="about"><li><div class="_5aj7"><div class="_4bl9"><div class="_54n- _2pi3"><div id="u_0_2e"></div></div></div><div class="_4bl7"><div class="_4ms4" id="u_0_2a"><div class="clearfix _ikh _5c0g" data-pnref="overview" id="u_0_2f"><div class="_4bl7"><ul class="uiList _1pi3 _4kg _6-h _703 _4ks"><li class="_3pw9 _2pi4"><div class="clearfix _4bbo" role="button" tabindex="0"><div class="_5rsw _3-91 _8o lfloat _ohe"><i class="_5rsx img sp_yw06AF9sktb sx_344683"></i></div><div class="_42ef"><div class="_6a"><div class="_6a _6b" style="height:36px"></div><div class="_6a _6b"><span class="_50f8 _2iem">No workplaces to show</span></div></div></div></div></li><li id="u_0_2b"><div class="clearfix _5y02" data-overviewsection="education" role="button" tabindex="0"><a class="_5uat _3-91 _8o lfloat _ohe" tabindex="-1" aria-hidden="true" href="https://www.facebook.com/pages/Cambridge-Institute-of-technolagy/133870693705509" data-hovercard="/ajax/hovercard/page.php?id=133870693705509" data-hovercard-prefer-more-content-show="1"><img class="_s0 _4ooo _54ru img" src="https://scontent.fblr6-1.fna.fbcdn.net/v/t1.0-1/c9.0.32.32/p32x32/580846_10149999285985791_1565762244_n.png?oh=d4ccc6a667e53f20db9cf60c0742f989&oe=5B1420C5" alt="" aria-label="Cambridge Institute of technolagy" role="img" /></a><div class="_42ef"><div class="_6a _5u5j _6b"><div class="_c24 _50f4">Studies at <a class="profileLink" href="https://www.facebook.com/pages/Cambridge-Institute-of-technolagy/133870693705509" data-hovercard="/ajax/hovercard/page.php?id=133870693705509" data-hovercard-prefer-more-content-show="1">Cambridge Institute of technolagy</a></div><div><div><div class="_50f8 _2ieq"><div class="fsm fwn fcg">Past: <a class="profileLink" href="https://www.facebook.com/deekshaintegrated/" data-hovercard="/ajax/hovercard/page.php?id=176180289071224" data-hovercard-prefer-more-content-show="1">Deeksha Integrated</a> and <a class="profileLink" href="https://www.facebook.com/pages/chethana-vidya-mandiratumkur/378826618888908" data-hovercard="/ajax/hovercard/page.php?id=378826618888908" data-hovercard-prefer-more-content-show="1">chethana vidya mandira,tumkur</a></div></div></div></div></div></div></div></li><li id="u_0_2c"><div class="clearfix _5y02" data-overviewsection="places" role="button" tabindex="0"><a class="_5uat _3-91 _8o lfloat _ohe" tabindex="-1" aria-hidden="true" href="https://www.facebook.com/pages/Bangalore-India/106377336067638" data-hovercard="/ajax/hovercard/page.php?id=106377336067638" data-hovercard-prefer-more-content-show="1"><img class="_s0 _4ooo _54ru img" src="https://external.fblr6-1.fna.fbcdn.net/safe_image.php?d=AQCKH3kcP1-A2NPe&w=32&h=32&url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F8%2F80%2FBangaloreMontage.png&cfs=1&fallback=hub_city&f&_nc_hash=AQDbJ1ytdhSz3E8E" alt="" aria-label="Bangalore, India" role="img" /></a><div class="_42ef"><div class="_6a _5u5j _6b"><div class="_c24 _50f4">Lives in <a class="profileLink" href="https://www.facebook.com/pages/Bangalore-India/106377336067638" data-hovercard="/ajax/hovercard/page.php?id=106377336067638" data-hovercard-prefer-more-content-show="1">Bangalore, India</a></div><div><div><div class="_50f8 _2ieq"><div class="fsm fwn fcg"><span id="u_0_2d">From <span class="fwb"><a class="profileLink" href="https://www.facebook.com/pages/Tumkur/106525352717093" data-hovercard="/ajax/hovercard/page.php?id=106525352717093" data-hovercard-prefer-more-content-show="1">Tumkur</a></span></span></div></div></div></div></div></div></div></li><li class="_3pw9 _2pi4"><div class="clearfix _4bbo" role="button" tabindex="0"><div class="_5rsw _3-91 _8o lfloat _ohe"><i class="_5rsx img sp_yw06AF9sktb sx_585866"></i></div><div class="_42ef"><div class="_6a"><div class="_6a _6b" style="height:36px"></div><div class="_6a _6b"><span class="_50f8 _2iem">No relationship info to show</span></div></div></div></div></li></ul></div><div class="_4bl9 _zu9"><ul class="uiList _5yql _4kg" data-overviewsection="contact_basic" role="button" tabindex="0"><li class="_4tnv _2pif"><div class="clearfix _ikh"><div class="_4bl7"><div class="_pvf _5pmc"><i class="img sp_yw06AF9sktb sx_e0cf75"></i></div></div><div class="_4bl9 _2pis _2dbl"><span class="_c24 _2ieq"><div><span class="accessible_elem">Birthday</span></div><div>April 28, 1998</div></span></div></div></li></ul></div></div></div></div></div></li></ul> --></code></div>
When I get the page source from my script , only <div class="hidden_elem"> </div>
this is coming .
Upvotes: 2
Views: 2051
Reputation: 325
You need to scroll down the page with:
String s = "window.scrollBy(0,document.body.scrollHeight || document.documentElement.scrollHeight)";
ScriptResult sr = page.executeJavaScript(s);
LOG.info("Result= " + sr.getJavaScriptResult());
After that, you will be able to get the "hidden_elem" list of objects:
String xpathHiddenElem = "//div[contains(@class, 'hidden_elem')]";
List<Object> responseHiddenElem = page.getByXPath(xpathHiddenElem);
LOG.info("responseHiddenElem: {}", responseHiddenElem);
if (responseHiddenElem != null && responseHiddenElem.size() > 0) {
for (Object element : responseHiddenElem) {
HtmlDivision elementCasted = (HtmlDivision) element;
LOG.info("elementContent: {}", elementCasted.getTextContent());
LOG.info("elementContent: {}", elementCasted.asText());
LOG.info("elementContent: {}", elementCasted.getTagName());
LOG.info("elementContent: {}", elementCasted.getIndex());
}
}
Upvotes: 1
Reputation: 1529
With BeautifulSoup you can do this
Try this:-
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html, 'lxml')
for comment in soup.findAll(text=lambda text:isinstance(text,Comment)):
print (comment)
Upvotes: 2