Reputation: 1075
I am trying to extract content from https://careers.microsoft.com/us/en/search-results and get the title, info, etc from the page
urlString <- "https://careers.microsoft.com/us/en/search-results?"
getHTML <- xml2::read_html(urlString)
t2 <- getHTML %>% html_text() %>% stringr::str_sub(start = 8027, end = 65679)
jsonWeb <- jsonlite::fromJSON(t2)
df <- jsonWeb$data$jobs
Is there a more elegant way to do it? Like extracting the json of phApp.ddo {} Thank you so much
Upvotes: 0
Views: 109
Reputation: 1579
Use the V8
package to run JS scripts on the page to get the phApp
object:
library(rvest)
library(V8)
pg <- read_html("https://careers.microsoft.com/us/en/search-results")
scripts <- pg %>% html_nodes(xpath = "//script[contains(.,'phApp')]") %>% html_text()
ct <- v8()
ct$eval("var phApp = {}")
for (js in scripts) ct$eval(js)
data <- ct$get("phApp")
jobs <- data$ddo$eagerLoadRefineSearch$data$jobs
Upvotes: 1
Reputation: 173858
It's not possible to get reliable results from web scraping a site like this, because you have no control over the content you are scraping. However, doing it by substring index is a disaster, because almost any change in the dynamic content will break your code (in fact, your code didn't work for me because the json string I was served was slightly shorter, so I got trailing garbage that wouldn't parse).
A more robust solution (though see caveat below) is to find useful delimiters at the start and end of the json string which you can use to cut away the parts you don't want.
urlString <- "https://careers.microsoft.com/us/en/search-results?"
getHTML <- xml2::read_html(urlString)
json <- jsonlite::fromJSON(strsplit(strsplit(html_text(getHTML),
"phApp\\.ddo = ")[[1]][2], "; phApp")[[1]][1])
json$eagerLoadRefineSearch$data$jobs
#> # A tibble: 50 x 27
#> country subCategory industry title multi_location type orgFunction
#> <chr> <chr> <lgl> <chr> <list> <lgl> <lgl>
#> 1 United~ Software E~ NA Prin~ <chr [1]> NA NA
#> 2 United~ Art NA Lead~ <chr [1]> NA NA
#> 3 India Support En~ NA Supp~ <chr [1]> NA NA
#> 4 Romania Support En~ NA Micr~ <chr [2]> NA NA
#> 5 China Solution S~ NA Seni~ <chr [1]> NA NA
#> 6 United~ Software E~ NA Soft~ <chr [1]> NA NA
#> 7 India Support En~ NA Supp~ <chr [1]> NA NA
#> 8 United~ Software E~ NA Seni~ <chr [1]> NA NA
#> 9 Japan Marketing ~ NA Full~ <chr [1]> NA NA
#> 10 United~ Software E~ NA Seni~ <chr [1]> NA NA
#> # ... with 40 more rows, and 20 more variables: experience <chr>,
#> # locale <chr>, multi_location_array <list>, jobSeqNo <chr>,
#> # postedDate <chr>, searchresults_display <lgl>,
#> # descriptionTeaser <chr>, dateCreated <chr>, state <chr>,
#> # targetLevel <chr>, jd_display <lgl>, reqId <lgl>, badge <chr>,
#> # jobId <chr>, isMultiLocation <lgl>, jobVisibility <list>,
#> # mostpopular <dbl>, location <chr>, category <chr>,
#> # locationLatlong <lgl>
I agree it would be better if you could request just the json, but in this case the page is constructed server-side, so there is no standalone xhr request to an API that delivers json, so you need to carve the json out of the served HTML. Regex isn't ideal for this, but it's far better than snipping fixed length strings.
Upvotes: 3