Reputation: 2167
I am constructing complete timelines of indicators for a set of years and countries on the basis of multiple datasets with varying quality.
Using reshape2
I have "melted" those datasets into a single dataframe.
Example dataset:
d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L,
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE",
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L,
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L,
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5,
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA,
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"),
source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council",
"Eurostat", "OECD"), class = "factor")), .Names = c("cntry",
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA,
-19L))
d
# cntry year indicator sex source
# 1 BE 1960 5.5 male Eurostat
# 2 BE 1970 1.2 male Eurostat
# 3 BE 1980 1.5 male Eurostat
# 4 DE 1960 NA male Eurostat
# 5 DE 1970 1.4 male Eurostat
# 6 GE 1960 NA male Eurostat
# 7 GE 1970 NA male Eurostat
# 8 BE 1960 5.5 male OECD
# 9 BE 1970 1.2 male OECD
# 10 DE 1960 2.3 male OECD
# 11 DE 1970 1.4 male OECD
# 12 GE 1960 NA male OECD
# 13 GE 1970 1.4 male OECD
# 14 BE 1960 NA male Council
# 15 BE 1970 NA male Council
# 16 DE 1960 2.3 male Council
# 17 DE 1970 1.4 male Council
# 18 GE 1970 1.4 male Council
# 19 GE 1980 NA male Council
I was hoping I could uses cast()
with fun.aggregate
to convert this long dataset into the wide format, while selecting the most high quality dataset (Eurostat > OECD > Council) for a given country-year combination to fill in the missings. Unfortunately I do not really understand how to work with such a custom aggregate function.
In other words, I want to reshape the dataset from a long to a wide format while merging multiple values depending on the value of a factor ("source"). Ideally it would work something as:
full_data <- expand.grid(c('BE', 'GE', 'DE'), c('1960', '1970', '1980'))
full_data <- fill_missings(full_data, d, pref_order=c('Eurostat', 'OECD', 'Council'))
full_data
# BE 1960 5.5 male Eurostat
# BE 1970 1.2 male Eurostat
# BE 1980 1.5 male Eurostat
# DE 1960 2.3 male OECD
# DE 1970 1.4 male Eurostat
# DE 1980 NA NA NA
# GE 1960 NA male Council
# GE 1970 1.4 male OECD
# GE 1980 NA male Council
and optionally (or directly) into the wide format:
# cntry sex 1960 1970 1980
# BE male 5.5 1.2 1.5
# DE male 2.3 1.4 NA
# GE male NA 1.4 NA
Upvotes: 3
Views: 1682
Reputation: 118859
Assuming that the data is in the order you require, that is, column source
is ordered first by Eurostat
, then by OECD
and then by council
, I'd go about using data.table
in this manner:
require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator",
subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))
# cntry sex 1960 1970 1980
# 1: BE male 5.5 1.2 1.5
# 2: DE male 2.3 1.4 NA
# 3: GE male NA 1.4 NA
Upvotes: 2
Reputation: 4180
Perhaps the following could work as well:
library(reshape2)
x <- melt(d,id.vars=c("cntry","year","source","sex"))
y <- dcast(x,cntry+year+sex ~ source)
y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
dcast(y,cntry + sex ~ year)
The source selection is made using a layered ifelse
statement. The indication of the source selected is lost with this approach, if that is an issue, a similar ifelse
statement can be added, creating the source origin variable:
y$selected.source <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes="Council",no="OECD"),no="Eurostat")
Upvotes: 1
Reputation: 52677
Here is another option:
library(reshape2)
d$source <- factor(d$source, levels=c('Eurostat', 'OECD', 'Council'))
d2 <- d[1:4]
d2[[3]] <- lapply(split(d, 1:nrow(d)), `[`, c(3, 5))
dcast(
d2, cntry + sex ~ year, value.var="indicator",
fun.aggregate=function(x) {
if(!length(x)) return(NA_real_)
xs <- do.call(rbind, x)
xs <- xs[complete.cases(xs), ]
if(nrow(xs)) xs[order(as.numeric(xs$source)), "indicator"][[1L]] else NA_real_
} )
Produces:
cntry sex 1960 1970 1980
1 BE male 105.5 101.2 101.5
2 DE male 2.3 101.4 NA
3 GE male NA 1.4 NA
Note I added 100 to "Eurostat" value to make them distinguishable from the others since in this sample set they seemed to be equal.
Basically, we cheat by turning the indicator
column into a column of list items containing both the indicator and the source, and then we use fun.aggregate
to pick the item from each group with the lowest source value (note we reset the factors so the most desirable source has the lowest level).
Upvotes: 0
Reputation: 193657
I am not sure if this meets all of your expectations, but it sounds like you're looking for something like the following:
toMerge <- expand.grid(cntry = c("BE", "DE", "GE"),
year = c(1960, 1970, 1980),
source = c("Eurostat", "OECD", "Council"),
sex = "male")
d2 <- merge(d, toMerge, all = TRUE)
d2$source <- factor(d2$source, c("Council", "OECD", "Eurostat"), ordered=TRUE)
d2 <- d2[order(d2$source, decreasing=TRUE), ]
Rank <- with(d2, ave(indicator, d2[c("cntry", "year", "sex")],
FUN = function(x) rank(x, ties.method="first", na.last=TRUE)))
D <- d2[Rank == 1, ]
D
# cntry year sex source indicator
# 2 BE 1960 male Eurostat 5.5
# 5 BE 1970 male Eurostat 1.2
# 8 BE 1980 male Eurostat 1.5
# 14 DE 1970 male Eurostat 1.4
# 17 DE 1980 male Eurostat NA
# 20 GE 1960 male Eurostat NA
# 26 GE 1980 male Eurostat NA
# 12 DE 1960 male OECD 2.3
# 24 GE 1970 male OECD 1.4
library(reshape2)
dcast(D, cntry ~ year, value.var="indicator")
# cntry 1960 1970 1980
# 1 BE 5.5 1.2 1.5
# 2 DE 2.3 1.4 NA
# 3 GE NA 1.4 NA
Upvotes: 1