rvest cannot find node with xpath

Question

This is the website I scapre ppp projects

I want to use xpath to select the node like below

The xpath I get by use inspect element is "//*[@id="pppListUl"]/li1/div2/span2/span"

My scrpits are like below:

a <- html("http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/toPPPList.do")
b <- html_nodes(a, xpath = '//*[@id="pppListUl"]/li[1]/div[2]/span[2]/span')
b

Then I got the result

{xml_nodeset (0)}

Then I check the page source, I didn't even find anything about the project I selected.

I was wondering why I cannot find it in the page source, and in turn, how can I get the node by rvest.

hrbrmstr · Accepted Answer

It makes an XHR request for the content. Just work with that data (it's pretty clean):

library(httr)

POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
     encode="form",
     body=list(queryPage=1,
               distStr="",
               induStr="",
               investStr="",
               projName="",
               sortby="",
               orderby="",
               stageArr="")) -> res

content(res, as="text") %>% 
  jsonlite::fromJSON(flatten=TRUE) %>% 
  dplyr::glimpse()

(StackOverflow isn't advanced enough to let me post the output of that as it thinks it's spam).

It's a 4 element list with fields totalCount, list (which has the actual data), currentPage and totalPage.

It looks like you can change the queryPage form variable to iterate through the pages to get the whole list/database, something like:

library(httr)
library(purrr)
library(dplyr)

get_page <- function(page_num=1, .pb=NULL) {

  if (!is.null(.pb)) pb$tick()$print()

  POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
       encode="form",
       body=list(queryPage=page_num,
                 distStr="",
                 induStr="",
                 investStr="",
                 projName="",
                 sortby="",
                 orderby="",
                 stageArr="")) -> res

  content(res, as="text") %>% 
    jsonlite::fromJSON(flatten=TRUE) -> dat

  dat$list

}

n <- 5 # change this to the value in `totalPage`

pb <- progress_estimated(n)
df <- map_df(1:n, get_page, pb)

rvest cannot find node with xpath

Answers (1)

Related Questions