Ramaranjan Ruj
Ramaranjan Ruj

Reputation: 55

Web Extraction from popups

I need to get the weblinks for all followers listed in the following page.

https://www.researchgate.net/topic/biotechnology

There are 206770 followers for this topic at this moment. When i click the "View all" button, a popup appears which gives a list and it keeps on expanding as i go down.

https://www.researchgate.net/profile/Kestutis_Sasnauskas ...

The above are the links for the top follower. Is there a way we can get the weblinks for all 206770 followers?

Upvotes: 4

Views: 123

Answers (3)

jdharrison
jdharrison

Reputation: 30425

The server returns the data as JSON if you ask for it. Subsequent calls use an offset parameter that the previous JSON call supplies. In the example below I have just called the first 10 offsets. This is equivalent to scrolling down 10 times. There is alot more data then just the profile web links:

library(RCurl)
library(XML)
library(jsonlite)
# get initial page
initURL <- "http://www.researchgate.net/topic/biotechnology"
doc <- htmlParse(initURL)
noFollowers <- doc["//*/strong/*/a[@class='js-see-all']", fun = xmlValue][[1]]
noFollowers <- as.integer(gsub("[^0-9]", "", noFollowers))

appURL <- "http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000"
appData <- getURL(appURL
                  , httpheader = c(accept = "application/json"))
follData <- list(fromJSON(appData)$result$data$content$data$listItems)
for(i in 1:10){
  nextURL <- fromJSON(appData)$result$data$nextOffset
  appData <- getURL(paste0(appURL, "&offset=", nextURL)
                    , httpheader = c(accept = "application/json"))
  follData[[i+1]] <- fromJSON(appData)$result$data$content$data$listItems
}
followers <- na.omit(do.call(c, lapply(follData, function(x){x$data$url})))
> head(followers)
[1] "profile/Subhashish_Dutta" "profile/Jerome_Wang3"     "profile/Jose_Carbajo2"   
[4] "profile/Daniele_Riccio"   "profile/Fiona_Togneri2"   "profile/Sukanya_Patel" 

Upvotes: 0

dimitris_ps
dimitris_ps

Reputation: 5951

This can be done with the use of rvest and RSelenium. The latter is mostly needed, the former will make your life easier. Install RSelenium from github devtools::install_github("ropensci/RSelenium"). rvest from cran.

Here is the code you need to accomplish what you are looking for.

siteUrl <- "http://www.researchgate.net/"
GateUrl <- "http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000&offset="

library(rvest)
library(RSelenium)

checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open(silent = FALSE)

i <- 0
profileUrls <- c()

for(j in 1:3){
  print(j)
  remDrv$navigate(paste0(GateUrl, i))
  l <- html(remDrv$getPageSource()[[1]])
  profileUrls <- c(profileUrls, 
               paste0(siteUrl, l %>% html_nodes(".display-name") %>% xml_attr("href")))
  i <- length(profileUrls)+1

}

remDrv$close()
profileUrls 

A couple of things here. You need to figure out the j loop. I think it picks up 38 profiles with each url, so the j should be something like for(j in 1:(followers/38)).

The second point is that the code is not very efficient in the way it saves the links i.e. it appends it each time. A better solution would be to use lapply and the unlist.

Last point you need mozilla firefox on your machine, since this is the default used from RSelenium though you can set it to use whichever of the most popular browsers you hove.

Results From the first 56

> profileUrls
[1] "http://www.researchgate.net/profile/Jose_Carbajo2"           
[2] "http://www.researchgate.net/profile/Daniele_Riccio"          
[3] "http://www.researchgate.net/profile/Fiona_Togneri2"          
[4] "http://www.researchgate.net/profile/Sukanya_Patel"           
[5] "http://www.researchgate.net/profile/Neri_Fattorini"          
[6] "http://www.researchgate.net/profile/Pham_Thi_Thuy_Van"       
[7] "http://www.researchgate.net/profile/Kestutis_Sasnauskas"     
[8] "http://www.researchgate.net/profile/Iris_Weintal"            
[9] "http://www.researchgate.net/profile/Godelieve_Verhaegen"     
[10] "http://www.researchgate.net/profile/Janani_Venkatraman2"     
[11] "http://www.researchgate.net/profile/Kai_Wang126"             
[12] "http://www.researchgate.net/profile/Irine_Ronin"             
[13] "http://www.researchgate.net/profile/Natasha_Ikhsan"          
[14] "http://www.researchgate.net/profile/Nadya_Hajar"             
[15] "http://www.researchgate.net/profile/Gayatr_Venkataraman2"    
[16] "http://www.researchgate.net/profile/Amsha_Viraragavan"       
[17] "http://www.researchgate.net/profile/Wei_Leiyan"              
[18] "http://www.researchgate.net/profile/Yosuke_Inada"            
[19] "http://www.researchgate.net/profile/Nadya_Hajar"             
[20] "http://www.researchgate.net/profile/Gayatr_Venkataraman2"    
[21] "http://www.researchgate.net/profile/Amsha_Viraragavan"       
[22] "http://www.researchgate.net/profile/Wei_Leiyan"              
[23] "http://www.researchgate.net/profile/Yosuke_Inada"            
[24] "http://www.researchgate.net/profile/Yongning_You"            
[25] "http://www.researchgate.net/profile/Susan_Hu6"               
[26] "http://www.researchgate.net/profile/Matt_Evans11"            
[27] "http://www.researchgate.net/profile/Nam_Kieu"                
[28] "http://www.researchgate.net/profile/Nur_Musa3"               
[29] "http://www.researchgate.net/profile/Varaporn_S"              
[30] "http://www.researchgate.net/profile/Askar_Begzat3"           
[31] "http://www.researchgate.net/profile/Bing_Wang63"             
[32] "http://www.researchgate.net/profile/Xuebin_Yan"              
[33] "http://www.researchgate.net/profile/Roberto_Sibaja_Hernandez"
[34] "http://www.researchgate.net/profile/Stephen_Heimann"         
[35] "http://www.researchgate.net/profile/Hanina_Hanifa"           
[36] "http://www.researchgate.net/profile/Bo_Wang143"              
[37] "http://www.researchgate.net/profile/Xuebin_Yan"              
[38] "http://www.researchgate.net/profile/Roberto_Sibaja_Hernandez"
[39] "http://www.researchgate.net/profile/Stephen_Heimann"         
[40] "http://www.researchgate.net/profile/Hanina_Hanifa"           
[41] "http://www.researchgate.net/profile/Bo_Wang143"              
[42] "http://www.researchgate.net/profile/Huili_Li5"               
[43] "http://www.researchgate.net/profile/Giuseppe_Infusini"       
[44] "http://www.researchgate.net/profile/Carmen_Wacher"           
[45] "http://www.researchgate.net/profile/Linyn_Linyn"             
[46] "http://www.researchgate.net/profile/Dan_Youel"               
[47] "http://www.researchgate.net/profile/Catherine_Williams16"    
[48] "http://www.researchgate.net/profile/Nichole_Macaraeg"        
[49] "http://www.researchgate.net/profile/Peter_Oroszlan"          
[50] "http://www.researchgate.net/profile/Eduard_Karamov"          
[51] "http://www.researchgate.net/profile/Mauricio_Franco3"        
[52] "http://www.researchgate.net/profile/Patricia_Zancan"         
[53] "http://www.researchgate.net/profile/Rohana_Dassanayake"      
[54] "http://www.researchgate.net/profile/Khadija_Khataby"         
[55] "http://www.researchgate.net/profile/Imane_Moest"             
[56] "http://www.researchgate.net/profile/Rory_Adey"

Upvotes: 1

lukeA
lukeA

Reputation: 54247

As an alternative to RSelenium, you could try it like this (first 56 followers as an example):

library(XML)
library(jsonlite)
offsets <- seq(from = 1, to = 50, 18)
urls <- sprintf("http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000&offset=%d", offsets)

df <- data.frame()
for (x in seq_along(urls)) {
  doc <- htmlParse(urls[x])
  script <- as(doc[['//script[5]']], "character")
  splits <- strsplit(script, '\\(function\\(\\)\\{Y\\.rg\\.createInitialWidget\\("[^\"]+",')[[1]][-1]
  res <- lapply(splits, function(split) {
    split <-sub(");})();\n", "", split, fixed = TRUE)
    res <- try(as.data.frame(t(unlist(fromJSON(gsub("\\\\", "", split))))), silent = TRUE)
    if (!inherits(res, "try-error")) return(res) else return(NULL)
  })
  df <- rbind(df, do.call(rbind, res[1:(length(res)-2)]))
}
dplyr::glimpse(df)
# Observations: 56
# Variables:
#   $ _isReact                                                         (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.displayName                                                 (fctr) Jose Maria Carbajo, Daniele Riccio, Fiona S Togneri, Sukanya Paramashivaiah Patel, Neri Fattorini, Pham thi thuy van, Kestutis Sasnauskas, Iris Weintal, Godelieve Verhaegen, Ja...
# $ data.profile.professionalInstitution.professionalInstitutionName (fctr) Instituto Nacional de Investigaciu00f3n y Tecnologu00eda Agraria y Alimentaria, University of Milan, Birmingham Women's NHS Foundation Trust, Himalya drug company, University o...
# $ data.profile.professionalInstitution.professionalInstitutionUrl  (fctr) institution/Instituto_Nacional_de_Investigaciones_y_Experiencias_Agronomicas_y_Forestales, institution/University_of_Milan, institution/Birmingham_Womens_NHS_Foundation_Trust, ...
# $ data.professionalInstitutionName                                 (fctr) Instituto Nacional de Investigaciu00f3n y Tecnologu00eda Agraria y Alimentaria, University of Milan, Birmingham Women's NHS Foundation Trust, Himalya drug company, University o...
# $ data.professionalInstitutionUrl                                  (fctr) institution/Instituto_Nacional_de_Investigaciones_y_Experiencias_Agronomicas_y_Forestales, institution/University_of_Milan, institution/Birmingham_Womens_NHS_Foundation_Trust, ...
# $ data.url                                                         (fctr) profile/Jose_Carbajo2, profile/Daniele_Riccio, profile/Fiona_Togneri2, profile/Sukanya_Patel, profile/Neri_Fattorini, profile/Pham_Thi_Thuy_Van, profile/Kestutis_Sasnauskas, pr...
# $ data.imageUrl                                                    (fctr) http://c1.rgstatic.net/m/797670414832/images/template/default/profile/profile_default_m.jpg, http://i1.rgstatic.net/i/profile/54a1a5539f8e2f289f_m_25d91.jpg, http://i1.rgstatic...
# $ data.imageSize                                                   (fctr) m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m
# $ data.imageHeight                                                 (fctr) 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ...
# $ data.imageWidth                                                  (fctr) 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ...
# $ data.enableFollowButton                                          (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.enableHideButton                                            (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.enableConnectionButton                                      (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.isClaimedAuthor                                             (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.hasExtraContainer                                           (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.showStatsWidgets                                            (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.showHideButton                                              (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.accountKey                                                  (fctr) Jose_Carbajo2, Daniele_Riccio, Fiona_Togneri2, Sukanya_Patel, Neri_Fattorini, Pham_Thi_Thuy_Van, Kestutis_Sasnauskas, Iris_Weintal, Godelieve_Verhaegen, Janani_Venkatraman2, Ka...
# $ data.hasInfoPopup                                                (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.hasTeaserPopup                                              (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.widgetId                                                    (fctr) rgw3_5539fc8299ef4, rgw4_5539fc8299ef4, rgw5_5539fc8299ef4, rgw6_5539fc8299ef4, rgw7_5539fc8299ef4, rgw8_5539fc8299ef4, rgw9_5539fc8299ef4, rgw10_5539fc8299ef4, rgw11_5539fc829...
# $ id                                                               (fctr) rgw3_5539fc8299ef4, rgw4_5539fc8299ef4, rgw5_5539fc8299ef4, rgw6_5539fc8299ef4, rgw7_5539fc8299ef4, rgw8_5539fc8299ef4, rgw9_5539fc8299ef4, rgw10_5539fc8299ef4, rgw11_5539fc829...
# $ templateName                                                     (fctr) application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, a...
# $ templateExtensions                                               (fctr) generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, ...
# $ widgetUrl                                                        (fctr) http://www.researchgate.net/application.PeopleAccountItem.html?entityId=7508014&imageSize=m&enableFollowButton=1&showHideButton=0&showConnectionButton=0&event=tp_followers_xflw...
# $ viewClass                                                        (fctr) views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views....
# $ yuiModules                                                       (fctr) rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleI...

Upvotes: 0

Related Questions