Reputation: 83
I have this type of data:
iso3 year UHC cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2016 0.3855273 4.837534
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA
and I would like to find the closest value to year 2012 and 2017 (+ ou - 2 years, i.e. for 2012 it can be 2010, 2011, 2013 or 2014 data) for cata10 variable. The output should be :
iso3year_UHC UHC year_cata cata10
AFG 2012 0.3468012 2013 14.631331
AFG 2017 0.3948606 2016 4.837534
AGO 2012 0.3400455 2011 12.379809
AGO 2017 0.3764945 2015 16.902584
EDIT: Note that I should have NA is there is no data 2 years before or after the reference year.
I have tried tones of commands since two days but could not manage to find a solution. Could you please advice on the type of commands to try?
Thank you very much,
N.
Upvotes: 1
Views: 507
Reputation: 270055
Here are three approaches. The first one is the clearest as it shows that the problem is really an aggregated and filtered self-join and directly models this and automatically handles the edge case mentioned in the comments without additional code. The second one uses a lapply
loop to get the desired effect but it involves more tedious manipulation although it does have the advantage of zero package dependencies. The last one gets around the fact that dplyr lacks complex self joins by performing a left join twice.
1) sqldf Using DF
defined reproducibly in the Note at the end perform a self join such that the difference in years is -2, -1, 1 or 2 and the iso3 codes are the same and cata10 is not NA in matching instance and among those rows we use min(...)
to find the row having the minimum absolute difference in the year. This uses the fact that SQLite has the feature that min(...)
will cause the entire row to be returned that satisfies the minimizing condition. Finally take only the 2012 and 2017 rows. The ability of SQL to directly model the constraints using a complex join allows us to directly model the requirements into code.
library(sqldf)
sqldf("select
a.iso3year iso3year_UHC,
a.UHC,
substr(b.iso3year, 5, 8) year_cata,
b.cata10,
substr(a.iso3year, 5, 8) year,
min(abs(substr(a.iso3year, 5, 8) - substr(b.iso3year, 5, 8))) min_value
from DF a
left join DF b on year - year_cata in (-2, -1, 1, 2) and
substr(a.iso3year, 1, 3) = substr(b.iso3year, 1, 3) and
b.cata10 is not null
group by a.iso3year
having year in ('2012', '2017')")[1:4]
giving:
iso3year_UHC UHC year_cata cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584
2) Base R This solution uses only base R. We first create year
and iso
variables by breaking up the iso3year
into two parts. ix
is an index into DF
giving the rows having 2012 or 2017 as their year. For each of those rows we find the nearest year having a cata10 value and create a row of the output data frame which lapply
returns as a list of rows, L
. Finally we rbind
those rows together. This is not as straight forward as (1) but does have the advantage of no package dependencies.
to.year <- function(x) as.numeric(substr(x, 5, 8))
year <- to.year(DF$iso3year)
iso <- substr(DF$iso3year, 1, 3)
ix <- which(year %in% c(2012, 2017))
L <- lapply(ix, function(i) {
DF0 <- na.omit(DF[iso[i] == iso & (year[i] - year) %in% c(-2, -1, 1, 2), ])
if (nrow(DF0)) {
with(DF0[which.min(abs(to.year(DF0$iso3year) - year[i])), c("iso3year", "cata10")],
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = as.numeric(substr(iso3year, 5, 8)),
cata10))
} else {
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = NA,
cata10 = NA)
}
})
do.call("rbind", L)
giving:
iso3year_UHC UHC year_cata cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584
3) dplyr/tidyr
First separate iso3year
into iso
and year
columns giving DF2
. Then pick out the 2012 and 2017 rows giving DF3
. Now left join DF3
to DF2
using iso
and get those rows for cata10
in the joined instance that are not NA and the absolute difference in years between the two joined data frames is 1 or 2. Then use slice
to pick out the row having least distance in years and select
out the desired columns giving DF4
Finally left join DF3
with DF4
which will fill out any rows for which there was no match.
library(dplyr)
library(tidyr)
DF2 <- DF %>%
separate(iso3year, c("iso", "year"), remove = FALSE, convert = TRUE)
DF3 <- DF2 %>%
filter(year %in% c(2012, 2017))
DF4 <- DF3 %>%
left_join(DF2, "iso") %>%
drop_na(cata10.y) %>%
filter(abs(year.x - year.y) %in% 1:2) %>%
group_by(iso3year.x) %>%
slice(which.min(abs(year.x - year.y))) %>%
ungroup %>%
select(iso3year = iso3year.x, UHC = UHC.x, year_cata = year.y, cata10 = cata10.y)
DF3 %>%
select(iso3year, UHC) %>%
left_join(DF4, c("iso3year", "UHC"))
giving:
# A tibble: 4 x 4
iso3year UHC year_cata cata10
<chr> <dbl> <int> <dbl>
1 AFG 2012 0.347 2013 14.6
2 AFG 2017 0.395 2016 4.84
3 AGO 2012 0.340 2011 12.4
4 AGO 2017 0.376 2015 16.9
Lines <- "iso3year UHC cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2016 0.3855273 4.837534
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA"
DF <- read.csv(text = gsub(" +", ",", Lines), as.is = TRUE)
Upvotes: 0
Reputation: 8523
Here is an answer with dplyr
only:
library(tidyverse)
uhc_comb = read.table(header = T, text = "
iso3 year UHC cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2026 0.3855273 4.837534 #Year is 2026 for the example
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA")
uhc_comb2 = uhc_comb %>%
pivot_longer(cols=c("UHC","cata10")) %>% #pivot UHC and cata10 to long format as columns "name" and "value"
filter(!is.na(value)) %>% #remove missing
group_by(iso3, name) %>% #for each iso3 and for each variable name (UHC and cata10)
mutate(dist=pmin(abs(year-2012),abs(year-2017))) %>% #compute the distance between the year and the targets and keep only the lowest
# filter(dist<=2) %>% #remove
top_n(-2, dist) %>% #select the minimal distance (in each group)
mutate(year=ifelse(dist>2, NA, year),
value=ifelse(dist>2, NA, value)) %>% #infer NA if distance is too high
select(-dist) #discard the now useless variable
uhc_comb2 %>%
pivot_wider(id_cols = iso3, values_from = c("year", "value")) %>% #pivot to wide again
unnest #since there are several values, unnest the lists from the dataframe
This will output some warnings but they are not significant. I'm not sure it is possible to remove them.
If you want to understand this better, run each line one by one. Pivoting tables is a tough brain gymnastic in the beginning.
EDIT: this will get you the right output with no warnings:
uhc_comb2 %>%
pivot_wider(id_cols = iso3,
values_from = c("year", "value"),
values_fn = list(value = list, year = list)) %>%
unnest(cols = c(year_cata10, year_UHC, value_cata10, value_UHC))
Upvotes: 0