salvu
salvu

Reputation: 519

RSelenium scraping for Disqus comments

I'm trying to scrape or obtain the text of Disqus comments from an online local newspaper using RSelenium in Chrome but am finding the going a little tough for my capabilities. I have searched many places but did not find the right information or I am using the wrong search terms (most probably).

So far I have managed to get the "normal" html from the pages but cannot pinpoint the right class, css selector or id to get the Disqus comments. I have also tried Selectorgadget but this only points to #dsq-app2 which selects the whole Disqus area at once and does not allow to select smaller parts of the area. I tried the same with RSelenium using elems <- mybrowser$findElement(using = "id", "dsq-app2") with an "environment" being stored in elems. Then I tried to find child elements within elems but came up blank.

Viewing the page via developer tools I can see that the interesting stuff is within an iframe called #dsq-app2 and have managed to extract all its source through the elems$getPageSource() after switching to the frame using elems$switchToFrame("dsq-app2"). This outputs all the html as one big "dirty" chunk and short of searching for the required stuff held in <p> tags and other elements of interest such as poster's usernames in data-role="username" and others, I don't seem to find the right way forward.

I have also tried using the advice given here but the Disqus setup is a little different. One of the pages I'm trying is this with the bulk of the comments area within a section called conversation and a ton of other id's such as posts and the un-ordered list with the id=post-list that ultimately carries the comments I need to scrape.

Any ideas or help tips are most welcome and received with thanks.

Upvotes: 2

Views: 701

Answers (1)

salvu
salvu

Reputation: 519

After a lot of testing and experimenting I managed. I don't know if it's the cleanest or prettiest solution but it works. Hope others will find it useful. Basically what I did was to find the url that points to the comments only. This is found within the "dsq-app2" iframe and is an attribute called src. At first I was also switching to the iframe but found that this works without.

remDr$navigate("toTheRequiredPage")
elemsource <- remDr$findElement(using = "id", value = "dsq-app2")
src <- elemsource$getElementAttribute("src") # find the src attribute within the iframe`
remDr$navigate(src[[1]]) # navigate to the src url

# find the posts from the new page
elem <- remDr$findElement(using = "id", value = "posts")
elem.posts <- elem$findChildElements(using = "id", value = "post-list")
elem.msgs <- elem.posts[[1]]$findChildElements(using = "class name", value = "post-message")

length(elem.msgs)
msgtext <- elem.msgs[[1]]$getElementText() # find first post's text
msgtext # print message

Update: I found out that if I use remDr$switchToFrame("dsq-app2") I do not need to use the src url as I have explained above. So there are actually two ways of scraping;

  1. Use switchToFrame("nameOfFrame") or
  2. Use my prior solution of using the src URL from the iframe

Hope this makes it clearer.

Upvotes: 1

Related Questions