Reputation: 209
I have a df with two columns: id and url. id contains project ids, and url contains website links which I would like to use for scraping ids of parent projects. Here is a sample of df that I have:
Here is a sample df:
df <- structure(list(id = c("P173165", "P175875", "P175841", "P175730"
), url = c("https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en",
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175875&apilang=en",
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175841&apilang=en",
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175730&apilang=en"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"))
> df
id url
1: P173165 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en
2: P175875 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175875&apilang=en
3: P175841 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175841&apilang=en
4: P175730 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175730&apilang=en
I was suggested by @Sirius that I can scrape parent project ids by using the following code:
library(jsonlite)
#let's do an example for row 1
json_data <- fromJSON("https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en")
json_data$projects[["P173165"]]$parentprojid
As you see, I input the url from the first row; and then I input the id from the first row. This code outputs a parent project id:
[1] "P147665"
I want to write a code that would automatise this process, and would create a vector that would contain the parent projects' ids. I would then assign this vector as a third column to my df. This is what I want to achieve:
id url par_proj_id
1: P173165 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en P147665
2: P175875 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175875&apilang=en P173883
3: P175841 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175841&apilang=en P170267
4: P175730 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175730&apilang=en P173799
I guess I should be using a for loop here, but I'm not sure how. Any ideas? I'd appreciate any help a lot.
Upvotes: 0
Views: 43
Reputation: 84465
You can put the request into a function and then use map2 from purrr to pass in the child id and url. This should be more efficient, and r'esque, than using a for loop.
library(magrittr)
library(jsonlite)
library(purrr)
get_parent_id <- function(child_id, url){
json_data <- jsonlite::fromJSON(url)
return(json_data$projects[[child_id]]$parentprojid)
}
df <- structure(list(id = c("P173165", "P175875", "P175841", "P175730"
), url = c("https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en",
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175875&apilang=en",
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175841&apilang=en",
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175730&apilang=en"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"))
df$par_proj_id <- purrr::map2(df$id, df$url, get_parent_id)
Upvotes: 1