Nellicopter
Nellicopter

Reputation: 83

Find the closest value for a certain year in R

I have this type of data:

iso3 year    UHC         cata10
AFG 2010    0.3551409   NA
AFG 2011    0.3496452   NA
AFG 2012    0.3468012   NA
AFG 2013    0.3567721   14.631331
AFG 2014    0.3647436   NA
AFG 2015    0.3717983   NA
AFG 2016    0.3855273   4.837534
AFG 2017    0.3948606   NA
AGO 2011    0.3250651   12.379809
AGO 2012    0.3400455   NA
AGO 2013    0.3397722   NA
AGO 2014    0.3385741   NA
AGO 2015    0.3521086   16.902584
AGO 2016    0.3636765   NA
AGO 2017    0.3764945   NA

and I would like to find the closest value to year 2012 and 2017 (+ ou - 2 years, i.e. for 2012 it can be 2010, 2011, 2013 or 2014 data) for cata10 variable. The output should be :

iso3year_UHC    UHC         year_cata   cata10   
AFG 2012        0.3468012   2013        14.631331
AFG 2017        0.3948606   2016        4.837534
AGO 2012        0.3400455   2011        12.379809
AGO 2017        0.3764945   2015        16.902584

EDIT: Note that I should have NA is there is no data 2 years before or after the reference year.

I have tried tones of commands since two days but could not manage to find a solution. Could you please advice on the type of commands to try?

Thank you very much,

N.

Upvotes: 1

Views: 507

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270055

Here are three approaches. The first one is the clearest as it shows that the problem is really an aggregated and filtered self-join and directly models this and automatically handles the edge case mentioned in the comments without additional code. The second one uses a lapply loop to get the desired effect but it involves more tedious manipulation although it does have the advantage of zero package dependencies. The last one gets around the fact that dplyr lacks complex self joins by performing a left join twice.

1) sqldf Using DF defined reproducibly in the Note at the end perform a self join such that the difference in years is -2, -1, 1 or 2 and the iso3 codes are the same and cata10 is not NA in matching instance and among those rows we use min(...) to find the row having the minimum absolute difference in the year. This uses the fact that SQLite has the feature that min(...) will cause the entire row to be returned that satisfies the minimizing condition. Finally take only the 2012 and 2017 rows. The ability of SQL to directly model the constraints using a complex join allows us to directly model the requirements into code.

library(sqldf)

sqldf("select 
      a.iso3year iso3year_UHC, 
      a.UHC, 
      substr(b.iso3year, 5, 8) year_cata, 
      b.cata10, 
      substr(a.iso3year, 5, 8) year, 
      min(abs(substr(a.iso3year, 5, 8) - substr(b.iso3year, 5, 8))) min_value 
    from DF a  
    left join DF b on year - year_cata in (-2, -1, 1, 2) and
      substr(a.iso3year, 1, 3) = substr(b.iso3year, 1, 3) and
      b.cata10 is not null
    group by a.iso3year
    having year in ('2012', '2017')")[1:4]

giving:

  iso3year_UHC       UHC year_cata    cata10
1     AFG 2012 0.3468012      2013 14.631331
2     AFG 2017 0.3948606      2016  4.837534
3     AGO 2012 0.3400455      2011 12.379809
4     AGO 2017 0.3764945      2015 16.902584

2) Base R This solution uses only base R. We first create year and iso variables by breaking up the iso3year into two parts. ix is an index into DF giving the rows having 2012 or 2017 as their year. For each of those rows we find the nearest year having a cata10 value and create a row of the output data frame which lapply returns as a list of rows, L. Finally we rbind those rows together. This is not as straight forward as (1) but does have the advantage of no package dependencies.

to.year <- function(x) as.numeric(substr(x, 5, 8))
year <- to.year(DF$iso3year)
iso <- substr(DF$iso3year, 1, 3)
ix <- which(year %in% c(2012, 2017))
L <- lapply(ix, function(i) {
  DF0 <- na.omit(DF[iso[i] == iso & (year[i] - year) %in% c(-2, -1, 1, 2), ])
  if (nrow(DF0)) {
    with(DF0[which.min(abs(to.year(DF0$iso3year) - year[i])), c("iso3year", "cata10")], 
      data.frame(iso3year_UHC = DF$iso3year[i], 
               UHC = DF$UHC[i], 
               year_cata = as.numeric(substr(iso3year, 5, 8)), 
               cata10))
  } else {
      data.frame(iso3year_UHC = DF$iso3year[i], 
               UHC = DF$UHC[i], 
               year_cata = NA,
               cata10 = NA)
  }
})
do.call("rbind", L)

giving:

  iso3year_UHC       UHC year_cata    cata10
1     AFG 2012 0.3468012      2013 14.631331
2     AFG 2017 0.3948606      2016  4.837534
3     AGO 2012 0.3400455      2011 12.379809
4     AGO 2017 0.3764945      2015 16.902584

3) dplyr/tidyr

First separate iso3year into iso and year columns giving DF2. Then pick out the 2012 and 2017 rows giving DF3. Now left join DF3 to DF2 using iso and get those rows for cata10 in the joined instance that are not NA and the absolute difference in years between the two joined data frames is 1 or 2. Then use slice to pick out the row having least distance in years and select out the desired columns giving DF4 Finally left join DF3 with DF4 which will fill out any rows for which there was no match.

library(dplyr)
library(tidyr)

DF2 <- DF %>%
  separate(iso3year, c("iso", "year"), remove = FALSE, convert = TRUE)

DF3 <- DF2 %>%
  filter(year %in% c(2012, 2017))

DF4 <- DF3 %>%
  left_join(DF2, "iso") %>%
  drop_na(cata10.y) %>%
  filter(abs(year.x - year.y) %in% 1:2) %>%
  group_by(iso3year.x) %>%
  slice(which.min(abs(year.x - year.y))) %>%
  ungroup %>%
  select(iso3year = iso3year.x, UHC = UHC.x, year_cata = year.y, cata10 = cata10.y)

DF3 %>% 
  select(iso3year, UHC) %>%
  left_join(DF4,  c("iso3year", "UHC"))

giving:

# A tibble: 4 x 4
  iso3year   UHC year_cata cata10
  <chr>    <dbl>     <int>  <dbl>
1 AFG 2012 0.347      2013  14.6 
2 AFG 2017 0.395      2016   4.84
3 AGO 2012 0.340      2011  12.4 
4 AGO 2017 0.376      2015  16.9 

Note

Lines <- "iso3year    UHC         cata10
AFG 2010    0.3551409   NA
AFG 2011    0.3496452   NA
AFG 2012    0.3468012   NA
AFG 2013    0.3567721   14.631331
AFG 2014    0.3647436   NA
AFG 2015    0.3717983   NA
AFG 2016    0.3855273   4.837534
AFG 2017    0.3948606   NA
AGO 2011    0.3250651   12.379809
AGO 2012    0.3400455   NA
AGO 2013    0.3397722   NA
AGO 2014    0.3385741   NA
AGO 2015    0.3521086   16.902584
AGO 2016    0.3636765   NA
AGO 2017    0.3764945   NA"
DF <- read.csv(text = gsub("  +", ",", Lines), as.is = TRUE)

Upvotes: 0

Dan Chaltiel
Dan Chaltiel

Reputation: 8523

Here is an answer with dplyr only:

library(tidyverse)

uhc_comb = read.table(header = T, text = "
iso3 year    UHC         cata10
AFG  2010    0.3551409   NA
AFG  2011    0.3496452   NA
AFG  2012    0.3468012   NA
AFG  2013    0.3567721   14.631331
AFG  2014    0.3647436   NA
AFG  2015    0.3717983   NA
AFG  2026    0.3855273   4.837534     #Year is 2026 for the example
AFG  2017    0.3948606   NA
AGO  2011    0.3250651   12.379809
AGO  2012    0.3400455   NA
AGO  2013    0.3397722   NA
AGO  2014    0.3385741   NA
AGO  2015    0.3521086   16.902584
AGO  2016    0.3636765   NA
AGO  2017    0.3764945   NA")

uhc_comb2 = uhc_comb %>% 
  pivot_longer(cols=c("UHC","cata10")) %>% #pivot UHC and cata10 to long format as columns "name" and "value"
  filter(!is.na(value)) %>% #remove missing
  group_by(iso3, name) %>% #for each iso3 and for each variable name (UHC and cata10)
  mutate(dist=pmin(abs(year-2012),abs(year-2017))) %>% #compute the distance between the year and the targets and keep only the lowest
  # filter(dist<=2) %>% #remove
  top_n(-2, dist) %>% #select the minimal distance (in each group)
  mutate(year=ifelse(dist>2, NA, year),
         value=ifelse(dist>2, NA, value)) %>% #infer NA if  distance is too high
  select(-dist) #discard the now useless variable

uhc_comb2 %>%
  pivot_wider(id_cols = iso3, values_from = c("year", "value")) %>% #pivot to wide again
  unnest #since there are several values, unnest the lists from the dataframe

This will output some warnings but they are not significant. I'm not sure it is possible to remove them.

If you want to understand this better, run each line one by one. Pivoting tables is a tough brain gymnastic in the beginning.

EDIT: this will get you the right output with no warnings:

uhc_comb2 %>%
  pivot_wider(id_cols = iso3, 
              values_from = c("year", "value"), 
              values_fn = list(value = list, year = list)) %>% 
  unnest(cols = c(year_cata10, year_UHC, value_cata10, value_UHC))

Upvotes: 0

Related Questions