rankthefirst
rankthefirst

Reputation: 1468

rvest cannot find node with xpath

This is the website I scapre ppp projects

I want to use xpath to select the node like below enter image description here

The xpath I get by use inspect element is "//*[@id="pppListUl"]/li1/div2/span2/span"

My scrpits are like below:

a <- html("http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/toPPPList.do")
b <- html_nodes(a, xpath = '//*[@id="pppListUl"]/li[1]/div[2]/span[2]/span')
b

Then I got the result

{xml_nodeset (0)}

Then I check the page source, I didn't even find anything about the project I selected.

I was wondering why I cannot find it in the page source, and in turn, how can I get the node by rvest.

Upvotes: 1

Views: 515

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78822

It makes an XHR request for the content. Just work with that data (it's pretty clean):

library(httr)

POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
     encode="form",
     body=list(queryPage=1,
               distStr="",
               induStr="",
               investStr="",
               projName="",
               sortby="",
               orderby="",
               stageArr="")) -> res

content(res, as="text") %>% 
  jsonlite::fromJSON(flatten=TRUE) %>% 
  dplyr::glimpse()

(StackOverflow isn't advanced enough to let me post the output of that as it thinks it's spam).

It's a 4 element list with fields totalCount, list (which has the actual data), currentPage and totalPage.

It looks like you can change the queryPage form variable to iterate through the pages to get the whole list/database, something like:

library(httr)
library(purrr)
library(dplyr)

get_page <- function(page_num=1, .pb=NULL) {

  if (!is.null(.pb)) pb$tick()$print()

  POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
       encode="form",
       body=list(queryPage=page_num,
                 distStr="",
                 induStr="",
                 investStr="",
                 projName="",
                 sortby="",
                 orderby="",
                 stageArr="")) -> res

  content(res, as="text") %>% 
    jsonlite::fromJSON(flatten=TRUE) -> dat

  dat$list

}

n <- 5 # change this to the value in `totalPage`

pb <- progress_estimated(n)
df <- map_df(1:n, get_page, pb)

Upvotes: 2

Related Questions