Reputation: 323
I have a dataset which looks like that:
set.seed(2)
origin <- rep(c("DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR"), 2)
year <- rep(c(1998,1998,1998,1998,1998,1998,1998,1998,1998,1998,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000), 2)
value <- sample(1:10000, size=length(origin), replace=TRUE)
test.df <- as.data.frame(cbind(origin, year, value))
rm(origin, year, value)
Then I have 2 lists.
The first one is a list of countries by regions build with the ISOcodes
library as follow:
library("ISOcodes")
list.continent <- list(asia = c("Central Asia", "Eastern Asia", "South-eastern Asia", "Southern Asia", "Western Asia"),
africa = c("Northern Africa", "Sub-Saharan Africa", "Eastern Africa", "Middle Africa", "Southern Africa", "Western Africa"),
europe = c("Eastern Europe", "Northern Europe", "Channel Islands", "Southern Europe", "Western Europe"),
oceania = c("Australia and New Zealand", "Melanesia", "Micronesia", "Polynesia"),
northamerica = c("Northern America"),
latinamerica = c("South America", "Central America", "Caribbean"))
country.list.continent <- sapply(list.continent, function(item) {
region <- subset(UN_M.49_Regions, Name %in% item)
sub <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
return(sub$ISO_Alpha_3)
}, simplify = FALSE)
rm(list.continent)
And an other list with the years:
year.list <- levels(as.factor(unique(test.df$year)))
I want to fill a matrix with computed numbers corresponding to a precise area for a specific year. The matrix is as follow:
ncol <- length(year.list)
nrow <- length(country.list.continent)
matrix.extraction <- matrix(, nrow = nrow, ncol = ncol)
rownames(matrix.extraction) <- names(country.list.continent)
colnames(matrix.extraction) <- year.list
To do my computation I have a loop to be able to subset the dataset too big otherwise... the loop is based on the years (equivalent to colnames(matrix.extraction)
). The idea is to compute what represent (in %) every country value during every year. The computation part is simple enough and works well. My problem arise when I need to attribute the value to each row.
for(i in 1:length(colnames(matrix.extraction))){
### I subset and compute what I want
table.temp <- test.df %>%
subset(year == colnames(matrix.extraction)[i]) %>%
group_by(origin) %>%
summarise(value = sum(value, na.rm = TRUE))
table.temp$percent <- prop.table(table.temp$value)
### then I need to attribute the wanted values
matrix.extraction["ROWNAME",i] <- table.temp %>%
subset(origin %in% country.list.continent$"ROWNAME") %>%
summarise(. ,sum = sum(percent)))
}
I really have no idea of how I can do such thing.
The expected result is a matrix like:
1998 2000
asia here NA
africa NA NA
europe NA NA
oceania NA NA
northamerica NA NA
latinamerica NA NA
with, instead of "here" in [1,1], the sum of the values of every country of the area in rowname for the year in colname.
Any help would be appreciated.
Upvotes: 1
Views: 504
Reputation: 887891
We could do this in tidyverse
. Convert the named list
to a two column dataset (enframe
or stack
), then do a full_join
with the 'test.df' after filter
ing only the 'year' contained in 'year.list', grouped by 'name, 'year', get the sum
of 'value' and spread
it to 'wide' format
library(tidyverse)
enframe(country.list.continent, value = "origin") %>%
unnest %>%
full_join(test.df %>%
filter(year %in% year.list)) %>%
group_by(name, year) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
spread(year, value, fill = 0) %>%
select(-4)
# A tibble: 6 x 3
# Groups: name [6]
# name `1998` `2000`
# <chr> <dbl> <dbl>
#1 africa 0 0
#2 asia 33038 18485
#3 europe 36658 35874
#4 latinamerica 14323 14808
#5 northamerica 15697 27405
#6 oceania 0 0
Or in base R
, this can be done by stack
ing the list
to a two column data.frame, merge
with the 'test.df' after subset
ing, and with xtabs
create a table
xtabs(value ~ ind + year, merge(stack(country.list.continent),
subset(test.df, year %in% year.list), by.x = "values", by.y = "origin"))
# year
#ind 1998 2000
# asia 33038 18485
# africa 0 0
# europe 36658 35874
# oceania 0 0
# northamerica 15697 27405
# latinamerica 14323 14808
test.df <- data.frame(origin, year, value)
Upvotes: 1
Reputation: 389275
Using double sapply
we can loop over all combinations of year.list
and
country.list.continent
and calculate sum
of value
for each combination.
sapply(year.list, function(x) sapply(names(country.list.continent), function(y) {
with(test.df, sum(value[origin %in% country.list.continent[[y]] & year == x]))
}))
# 1998 2000
#asia 21759 20059
#africa 0 0
#europe 39700 35981
#oceania 0 0
#northamerica 21347 17324
#latinamerica 10847 8672
If we are interested in a tidyverse
solution
library(tidyverse)
crossing(x = year.list, y = names(country.list.continent)) %>%
mutate(sum = map2_dbl(x, y, ~
test.df %>%
filter(year == .x & origin %in% country.list.continent[[.y]]) %>%
summarise(total = sum(value)) %>%
pull(total)))
# x y sum
# <chr> <chr> <dbl>
# 1 1998 africa 0
# 2 1998 asia 21759
# 3 1998 europe 39700
# 4 1998 latinamerica 10847
# 5 1998 northamerica 21347
# 6 1998 oceania 0
# 7 2000 africa 0
# 8 2000 asia 20059
# 9 2000 europe 35981
#10 2000 latinamerica 8672
#11 2000 northamerica 17324
#12 2000 oceania 0
You have numbers stored as factors in test.df
, we need to change them to actual numbers. Run the below before applying the above methods.
test.df[-1] <- lapply(test.df[-1], function(x) as.numeric(as.character(x)))
Upvotes: 1