Reputation: 1304
I need to filter my data, that is somehow hierarchical according to some conditions.
my data on exports looks something like this, but for multiple countries and years.
df3dgt <- data.frame(
"Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA", "USA", "USA","USA","EU", "EU","EU","EU","EU", "EU","EU","EU","EU"),
"Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU", "EU","EU","EU","USA", "USA", "USA","USA","USA", "USA", "USA","USA","USA"),
"Commodity code" = c("1", "11", "111", "112", "12","2", "21","211", "22", "3", "1", "11", "111", "112", "2", "21", "211", "212", "22"),
"Value" = c( 100, 50, 25, 5, 40, 200, 170, 170, 30, 220, 190, 190, 120, 30, 300, 200, 150, 50, 100),
stringsAsFactors = FALSE)
Commodity codes aggregate data at different levels. For instance, 111 (eg. apple) and 112 (e.g. bananas) are sub-groups of commodity 11 (e.g. fruit), similarly, 11 (fruit) and 12 (vegetables) are subcategories of 1 (e.g. food).
I need to filter the data to separate complete data from the rest.
I want to filter according to two conditions:
(1) filter the data where the "value" of the sub-commodity categories is equal to the value reported at the higher level of aggregation. For instance, Commodity code 1 of USA export to EU is incomplete. Commodity 112 (val=5) and commodity 111(val=25) do not aggregate to the value of commodity 11 (val=50). similarly the value of 11 (val=50) and 12 (val=40) do not aggregate to the value of commodity code 1 (100) Conversely, category 2 of EU export to US is complete. Commodity 211 (val=150) and 212 (val=50) aggregate to the level of commodity 21 (val=200). Also, the value of product category 21 (200) and 22 (100) aggregate to the level of Commodity 2.
2) I also what to filter separately the data that is only reported at higher levels of commodity code. For reference of which data is reported only at higher levels, please consider the illustratory list of commodity code below:
Comlist <- c("1", "11", "111", "112", "12","2", "21","211", "22","221", "3","31", "32", "311", "321")
Comlist <- as.data.frame(Comlist)
Hence, I want to filter, in the export between USA and EU, commodity 22 because I know a category 221 exist and it is not reported. Similarly, for category 3, that it is not reported in its lower levels.
To deal with (1) I am considering one level at the time (first two and three digit product category and then one and two). I first create a new variable for every level of product category
# create new variable Prodcat1
df1 <- df %>%
group_by(Reporter, Partner) %>%
mutate(Prodcat1 = str_extract(Product.cat., "^.{1}"))
# create new variable Prodcat2 for my 2nd level product category
df2 <- df1 %>%
group_by(Reporter, Partner) %>%
mutate(Prodcat2 = str_extract(Product.cat., "^.{2}"))
Then I filter
df2.Incomplete <- df2 %>%
group_by(Reporter, Partner, Prodcat2) %>%
filter(sum(Val[2:n()]) < Val[1])`
This, however, only includes data with the commodity code with two or three digits, while I would like to incorporate also the 1st digit of incomplete data. e.g. it reports rows with "commodity code" 111 and 11 but not the commodity code "1" of incomplete groups. Moreover, I am not sure to proceed to filter for case (2) considering that I have a hundred countries and product categories to consider.
thank you very much in advance for your help.
Upvotes: 0
Views: 126
Reputation: 5281
It is still a little bit confusing what your aim exactly is. But here is a function that can be of help:
Function
# check function
checkLowerLevels <- function (df = df3dgt, defaultValue = TRUE) {
# set column commodity.code to integer
class(df$Commodity.code) <- 'integer'
# checks
out <- vapply(1:nrow(df), function (k, defaultValue) {
# Checks at lower level if values add up to parent level
# if no lower exists, default value is TRUE.
#
# For a given incomplete parent return parent + children
# separated by ';' (can be split using strsplit).
currentCode <- .subset2(df, 3)[k]
currentRepo <- .subset2(df, 1)[k]
currentPart <- .subset2(df, 2)[k]
lowerLevels <- (.subset2(df, 3) > currentCode*10-1) & (.subset2(df, 3) < (currentCode+1)*10) & (.subset2(df, 1) == currentRepo) & (.subset2(df, 2) == currentPart)
check <- ifelse(!any(lowerLevels), defaultValue, sum(.subset2(df, 4L)[lowerLevels]) == .subset2(df, 4L)[k])
res <- NA_character_
if (check) return (c(res,check))
res <- paste(currentCode, paste(.subset2(df, 3L)[lowerLevels], collapse = ';'), sep = ';')
return(c(res,check))
}, character(2), defaultValue = defaultValue)
out <- t(out)
ans <- data.frame(check=out[,2], subset=out[,1], stringsAsFactors = FALSE)
class(ans$check) <- 'logical'
ans
}
Result
Here is an application:
# create new columns in data frame
df3dgt <- cbind(df3dgt, checkLowerLevels())
df3dgt
# Reporter Partner Commodity.code Value check subset
# 1 USA EU 1 100 FALSE 1;11;12
# 2 USA EU 11 50 FALSE 11;111;112
# 3 USA EU 111 25 TRUE <NA>
# 4 USA EU 112 5 TRUE <NA>
# 5 USA EU 12 40 TRUE <NA>
# 6 USA EU 2 200 TRUE <NA>
# 7 USA EU 21 170 TRUE <NA>
# 8 USA EU 211 170 TRUE <NA>
# 9 USA EU 22 30 TRUE <NA>
# 10 USA EU 3 220 TRUE <NA>
# 11 EU USA 1 190 TRUE <NA>
# 12 EU USA 11 190 FALSE 11;111;112
# 13 EU USA 111 120 TRUE <NA>
# 14 EU USA 112 30 TRUE <NA>
# 15 EU USA 2 300 TRUE <NA>
# 16 EU USA 21 200 TRUE <NA>
# 17 EU USA 211 150 TRUE <NA>
# 18 EU USA 212 50 TRUE <NA>
# 19 EU USA 22 100 TRUE <NA>
Explanation
checkLowerLevels
checks completeness for any code n
filtered by reporter & partner (such that n = 1
is treated once for USA -> EU
and once for 'EU -> USA')n
and the sublevels n1, n2, ..., nm
in the format n;n1;n2;...;nm
Upvotes: 1