agstudy
agstudy

Reputation: 121568

What is the right xpath to scrape this web page?

I tried to get the list of selectors in this page :

$("#Lastname"),$(".intro"),....

Here my attempt using xpathSApply:

library(XML)
library(RCurl)
a <- getURL('http://www.w3schools.com/jquery/trysel.asp')
doc <- htmlParse(a)
xpathSApply(doc,'//*[@id="selectorOptions"]') ## I can't get the right xpath

I tried also but without success:

xpathSApply(doc,'//*[@id="selectorOptions"]/div[i]')

EDIT I add python tag since I accept a python solution also.

Upvotes: 3

Views: 406

Answers (2)

jdharrison
jdharrison

Reputation: 30425

The following is an R way to get at javascript pages like this. You will need to use a browser as noted by @Peyton. Selenium server is one good way to control a browser. I have written some bindings for R for selenium server at https://github.com/johndharrison/RSelenium

The following would allow one to access the post javascript source:

require(devtools)
devtools::install_github("RSelenium", "johndharrison")
library(RSelenium)
library(RJSONIO)

# one needs to have an active server running
# the following commented out lines source the latest java binary
# RSelenium::checkForServer()
# RSelenium::startServer()
# a selenium server is assummed to be running now

remDR <- remoteDriver$new()
remDR$open() # opens a browser usually firefox with default settings
remDR$navigate('http://www.w3schools.com/jquery/trysel.asp') # navigate to your page
webElem <- remDR$findElements(value = "//*[@id='selectorOptions']") # find your elememts

# display the appropriate quantities
cat(fromJSON(webElem[[1]]$getElementText())$value)
> cat(fromJSON(webElem[[1]]$getElementText())$value)
$("#Lastname")
$(".intro")
$(".intro, #Lastname")
$("h1")
$("h1, p")
$("p:first")
$("p:last")
$("tr:even")
$("tr:odd")
$("p:first-child")
$("p:first-of-type")
$("p:last-child")
$("p:last-of-typ
.....................

UPDATE:

A much easier way to access the information in this case is to use the executeScript method

library(RSelenium)
RSelenium:startServer()
remDr$open()
remDR$navigate('http://www.w3schools.com/jquery/trysel.asp')
remDr$executeScript("return w3Sels;")[[1]]

> remDr$executeScript("return w3Sels;")[[1]]
 [1] "#Lastname"              ".intro"                
 [3] ".intro, #Lastname"      "h1"                    
 [5] "h1, p"                  "p:first"               
 [7] "p:last"                 "tr:even"               
 [9] "tr:odd"                 "p:first-child"         
[11] "p:first-of-type"        "p:last-child"          
[13] "p:last-of-type"         "li:nth-child(1)"       
[15] "li:nth-last-child(1)"   "li:nth-of-type(2)"     
[17] "li:nth-last-of-type(2)" "b:only-child"          
[19] "h3:only-of-type"        "div > p"               
[21] "div p"                  "ul + h3"               
[23] "ul ~ table"             "ul li:eq(0)"           
[25] "ul li:gt(0)"            "ul li:lt(2)"           
[27] ":header"                ":header:not(h1)"       
[29] ":animated"              ":focus"                
[31] ":contains(Duck)"        "div:has(p)"            
[33] ":empty"                 ":parent"               
[35] "p:hidden"               "table:visible"         
[37] ":root"                  "p:lang(it)"            
[39] "[id]"                   "[id=my-Address]"       
[41] "p[id!=my-Address]"      "[id$=ess]"             
[43] "[id|=my]"               "[id^=L]"               
[45] "[title~=beautiful]"     "[id*=s]"               
[47] ":input"                 ":text"                 
[49] ":password"              ":radio"                
[51] ":checkbox"              ":submit"               
[53] ":reset"                 ":button"               
[55] ":image"                 ":file"                 
[57] ":enabled"               ":disabled"             
[59] ":selected"              ":checked"              
[61] "*"

Upvotes: 4

agstudy
agstudy

Reputation: 121568

Thanks to jdharrison comment I parsed the javascript code to extract all selectors. As mentioned by Peyton this works in this particular case since all the selectors are in code.

capture.output(xpathSApply(doc,'//*/script')[[6]],
               file='test.js')
ll <- readLines('test.js')
ll <- ll[grepl('w3Sels.push',ll)]
ll <- unlist(regmatches(ll, gregexpr("(?<=\\().*?(?=\\))", ll, perl=T)))

 cat(head(ll))
"#Lastname" ".intro" ".intro, #Lastname" "h1" "h1, p" "p:first"

Upvotes: 0

Related Questions