Tarak
Tarak

Reputation: 1075

R web scraping issue on extracting content through rvest

I am trying to extract content from https://careers.microsoft.com/us/en/search-results and get the title, info, etc from the page

urlString <- "https://careers.microsoft.com/us/en/search-results?"
getHTML <- xml2::read_html(urlString)
t2 <- getHTML %>% html_text() %>% stringr::str_sub(start = 8027, end = 65679)
jsonWeb <- jsonlite::fromJSON(t2)
df <- jsonWeb$data$jobs

Is there a more elegant way to do it? Like extracting the json of phApp.ddo {} Thank you so much

Upvotes: 0

Views: 109

Answers (2)

xwhitelight
xwhitelight

Reputation: 1579

Use the V8 package to run JS scripts on the page to get the phApp object:

library(rvest)
library(V8)
pg <- read_html("https://careers.microsoft.com/us/en/search-results")
scripts <- pg %>% html_nodes(xpath = "//script[contains(.,'phApp')]") %>% html_text()
ct <- v8()
ct$eval("var phApp = {}")
for (js in scripts) ct$eval(js)
data <- ct$get("phApp")
jobs <- data$ddo$eagerLoadRefineSearch$data$jobs

enter image description here

Upvotes: 1

Allan Cameron
Allan Cameron

Reputation: 173858

It's not possible to get reliable results from web scraping a site like this, because you have no control over the content you are scraping. However, doing it by substring index is a disaster, because almost any change in the dynamic content will break your code (in fact, your code didn't work for me because the json string I was served was slightly shorter, so I got trailing garbage that wouldn't parse).

A more robust solution (though see caveat below) is to find useful delimiters at the start and end of the json string which you can use to cut away the parts you don't want.

urlString <- "https://careers.microsoft.com/us/en/search-results?"
getHTML <- xml2::read_html(urlString)

json <- jsonlite::fromJSON(strsplit(strsplit(html_text(getHTML), 
        "phApp\\.ddo = ")[[1]][2], "; phApp")[[1]][1])
  
json$eagerLoadRefineSearch$data$jobs

#> # A tibble: 50 x 27
#>    country subCategory industry title multi_location type  orgFunction
#>    <chr>   <chr>       <lgl>    <chr> <list>         <lgl> <lgl>      
#>  1 United~ Software E~ NA       Prin~ <chr [1]>      NA    NA         
#>  2 United~ Art         NA       Lead~ <chr [1]>      NA    NA         
#>  3 India   Support En~ NA       Supp~ <chr [1]>      NA    NA         
#>  4 Romania Support En~ NA       Micr~ <chr [2]>      NA    NA         
#>  5 China   Solution S~ NA       Seni~ <chr [1]>      NA    NA         
#>  6 United~ Software E~ NA       Soft~ <chr [1]>      NA    NA         
#>  7 India   Support En~ NA       Supp~ <chr [1]>      NA    NA         
#>  8 United~ Software E~ NA       Seni~ <chr [1]>      NA    NA         
#>  9 Japan   Marketing ~ NA       Full~ <chr [1]>      NA    NA         
#> 10 United~ Software E~ NA       Seni~ <chr [1]>      NA    NA         
#> # ... with 40 more rows, and 20 more variables: experience <chr>,
#> #   locale <chr>, multi_location_array <list>, jobSeqNo <chr>,
#> #   postedDate <chr>, searchresults_display <lgl>,
#> #   descriptionTeaser <chr>, dateCreated <chr>, state <chr>,
#> #   targetLevel <chr>, jd_display <lgl>, reqId <lgl>, badge <chr>,
#> #   jobId <chr>, isMultiLocation <lgl>, jobVisibility <list>,
#> #   mostpopular <dbl>, location <chr>, category <chr>,
#> #   locationLatlong <lgl>

I agree it would be better if you could request just the json, but in this case the page is constructed server-side, so there is no standalone xhr request to an API that delivers json, so you need to carve the json out of the served HTML. Regex isn't ideal for this, but it's far better than snipping fixed length strings.

Upvotes: 3

Related Questions