Reputation: 519
I'm trying to scrape or obtain the text of Disqus comments from an online local newspaper using RSelenium in Chrome but am finding the going a little tough for my capabilities. I have searched many places but did not find the right information or I am using the wrong search terms (most probably).
So far I have managed to get the "normal" html from the pages but cannot pinpoint the right class, css selector or id to get the Disqus comments. I have also tried Selectorgadget but this only points to #dsq-app2
which selects the whole Disqus area at once and does not allow to select smaller parts of the area. I tried the same with RSelenium using elems <- mybrowser$findElement(using = "id", "dsq-app2")
with an "environment" being stored in elems
. Then I tried to find child elements within elems
but came up blank.
Viewing the page via developer tools I can see that the interesting stuff is within an iframe called #dsq-app2
and have managed to extract all its source through the elems$getPageSource()
after switching to the frame using elems$switchToFrame("dsq-app2")
. This outputs all the html as one big "dirty" chunk and short of searching for the required stuff held in <p>
tags and other elements of interest such as poster's usernames in data-role="username"
and others, I don't seem to find the right way forward.
I have also tried using the advice given here but the Disqus setup is a little different. One of the pages I'm trying is this with the bulk of the comments area within a section called conversation
and a ton of other id's such as posts
and the un-ordered list with the id=post-list
that ultimately carries the comments I need to scrape.
Any ideas or help tips are most welcome and received with thanks.
Upvotes: 2
Views: 701
Reputation: 519
After a lot of testing and experimenting I managed. I don't know if it's the cleanest or prettiest solution but it works. Hope others will find it useful. Basically what I did was to find the url that points to the comments only. This is found within the "dsq-app2" iframe
and is an attribute
called src
. At first I was also switching to the iframe but found that this works without.
remDr$navigate("toTheRequiredPage")
elemsource <- remDr$findElement(using = "id", value = "dsq-app2")
src <- elemsource$getElementAttribute("src") # find the src attribute within the iframe`
remDr$navigate(src[[1]]) # navigate to the src url
# find the posts from the new page
elem <- remDr$findElement(using = "id", value = "posts")
elem.posts <- elem$findChildElements(using = "id", value = "post-list")
elem.msgs <- elem.posts[[1]]$findChildElements(using = "class name", value = "post-message")
length(elem.msgs)
msgtext <- elem.msgs[[1]]$getElementText() # find first post's text
msgtext # print message
Update: I found out that if I use remDr$switchToFrame("dsq-app2")
I do not need to use the src
url as I have explained above. So there are actually two ways of scraping;
switchToFrame("nameOfFrame")
orsrc
URL from the iframeHope this makes it clearer.
Upvotes: 1