Reputation: 107
I'm trying to scrape links and clicks from the url listed below. I'm able to scrape "clicks" using xPath but I have issue while scraping "links": these data are "NA". Could please anyone explain this and how to fix it? Here's my script
library(RSelenium)
library(XML)
remDr <- remoteDriver(remoteServerAddr= "192.168.99.100", port = 4445L)
remDr$open()
remDr$navigate("http://bit.d o")
logbutton <- remDr$findElement("css selector", "#top_login_info a:nth-child(1)")
logbutton$clickElement()
user <- remDr$findElement('css selector', '#login_user_username')
pass <- remDr$findElement('css selector', '#login_user_password')
user$sendKeysToElement(list('test0001'))
pass$sendKeysToElement(list('qwerty1234'))
logb <- remDr$findElement('css selector', '.btn-primary')
logb$clickElement()
remDr$navigate('http://bit.d o/admin/url/http%3A%7C%7C2F%7C%7C2Fedition.cnn.com%7C%7C2F2017%7C%7C2F07%7C%7C2F21%7C%7C2Fopinions%7C%7C2Ftrump-russia-putin-lain-opinion%7C%7C2Findex.html')
html <- htmlParse(remDr$getPageSource()[[1]])
clicks = xpathSApply(html,'//td//span[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]')
links = xpathSApply(html, '//td//br+//a')
IMPORTANT: I HAD TO PUT A SPACE BETWEEN "D" AND "O" IN DOMAIN NAME DUE TO A SO RESTRICTION
Upvotes: 0
Views: 340
Reputation: 630
It seems that you have an incorrect XPATH for links. I used selector gadget and extracted the following for the links (wasn't sure which you are interested in, so xpaths for both short (bit.do/...) and long (cnn.com./...) links are below:
short_links <- xpathSApply(html, '//td//a[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]')
long_links <- xpathSApply(html, '//span[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]')
By the way, be careful with the credentials (login and password) you have provided in the question. I would delete them shortly after you got your answer.
Upvotes: 1