Alex
Alex

Reputation: 1304

Filter (hierarchical) data with conditions and sub-categories

I need to filter my data, that is somehow hierarchical according to some conditions.

my data on exports looks something like this, but for multiple countries and years.

df3dgt <- data.frame(
"Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA", "USA", "USA","USA","EU", "EU","EU","EU","EU", "EU","EU","EU","EU"),
"Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU", "EU","EU","EU","USA", "USA", "USA","USA","USA", "USA", "USA","USA","USA"), 
"Commodity code" = c("1", "11", "111", "112", "12","2", "21","211", "22", "3", "1", "11", "111", "112", "2", "21", "211", "212", "22"), 
 "Value" = c( 100, 50, 25, 5, 40, 200, 170, 170, 30, 220, 190, 190, 120, 30, 300, 200, 150, 50, 100), 
 stringsAsFactors = FALSE)

Commodity codes aggregate data at different levels. For instance, 111 (eg. apple) and 112 (e.g. bananas) are sub-groups of commodity 11 (e.g. fruit), similarly, 11 (fruit) and 12 (vegetables) are subcategories of 1 (e.g. food).

I need to filter the data to separate complete data from the rest.

I want to filter according to two conditions:

(1) filter the data where the "value" of the sub-commodity categories is equal to the value reported at the higher level of aggregation. For instance, Commodity code 1 of USA export to EU is incomplete. Commodity 112 (val=5) and commodity 111(val=25) do not aggregate to the value of commodity 11 (val=50). similarly the value of 11 (val=50) and 12 (val=40) do not aggregate to the value of commodity code 1 (100) Conversely, category 2 of EU export to US is complete. Commodity 211 (val=150) and 212 (val=50) aggregate to the level of commodity 21 (val=200). Also, the value of product category 21 (200) and 22 (100) aggregate to the level of Commodity 2.

2) I also what to filter separately the data that is only reported at higher levels of commodity code. For reference of which data is reported only at higher levels, please consider the illustratory list of commodity code below:

 Comlist <- c("1", "11", "111", "112", "12","2", "21","211", "22","221", "3","31", "32", "311", "321")
 Comlist <- as.data.frame(Comlist)

Hence, I want to filter, in the export between USA and EU, commodity 22 because I know a category 221 exist and it is not reported. Similarly, for category 3, that it is not reported in its lower levels.

To deal with (1) I am considering one level at the time (first two and three digit product category and then one and two). I first create a new variable for every level of product category

# create new variable Prodcat1
df1 <- df %>%
group_by(Reporter, Partner) %>%
mutate(Prodcat1 = str_extract(Product.cat., "^.{1}")) 

# create new variable Prodcat2 for my 2nd level product category
df2 <- df1 %>%
group_by(Reporter, Partner) %>%
mutate(Prodcat2 = str_extract(Product.cat., "^.{2}"))  

Then I filter

 df2.Incomplete <- df2 %>%
 group_by(Reporter, Partner, Prodcat2) %>%
 filter(sum(Val[2:n()]) < Val[1])`

This, however, only includes data with the commodity code with two or three digits, while I would like to incorporate also the 1st digit of incomplete data. e.g. it reports rows with "commodity code" 111 and 11 but not the commodity code "1" of incomplete groups. Moreover, I am not sure to proceed to filter for case (2) considering that I have a hundred countries and product categories to consider.

thank you very much in advance for your help.

Upvotes: 0

Views: 126

Answers (1)

niko
niko

Reputation: 5281

It is still a little bit confusing what your aim exactly is. But here is a function that can be of help:

Function

# check function
checkLowerLevels <- function (df = df3dgt, defaultValue = TRUE) {
  # set column commodity.code to integer
  class(df$Commodity.code) <- 'integer' 
  # checks
  out <- vapply(1:nrow(df), function (k, defaultValue) {
    # Checks at lower level if values add up to parent level
    # if no lower exists, default value is TRUE.
    # 
    # For a given incomplete parent return parent + children
    # separated by ';' (can be split using strsplit).
    currentCode <- .subset2(df, 3)[k]
    currentRepo <- .subset2(df, 1)[k]
    currentPart <- .subset2(df, 2)[k]
    lowerLevels <- (.subset2(df, 3) > currentCode*10-1) & (.subset2(df, 3) < (currentCode+1)*10) & (.subset2(df, 1) == currentRepo) & (.subset2(df, 2) == currentPart)
    check <- ifelse(!any(lowerLevels), defaultValue, sum(.subset2(df, 4L)[lowerLevels]) == .subset2(df, 4L)[k])
    res <- NA_character_
    if (check) return (c(res,check))
    res <- paste(currentCode, paste(.subset2(df, 3L)[lowerLevels], collapse = ';'), sep = ';')
    return(c(res,check)) 
  }, character(2), defaultValue = defaultValue)
  out <- t(out)
  ans <- data.frame(check=out[,2], subset=out[,1], stringsAsFactors = FALSE)
  class(ans$check) <- 'logical'
  ans
}

Result

Here is an application:

# create new columns in data frame
df3dgt <- cbind(df3dgt, checkLowerLevels())
df3dgt 
#    Reporter Partner Commodity.code Value check     subset
# 1       USA      EU              1   100 FALSE    1;11;12
# 2       USA      EU             11    50 FALSE 11;111;112
# 3       USA      EU            111    25  TRUE       <NA>
# 4       USA      EU            112     5  TRUE       <NA>
# 5       USA      EU             12    40  TRUE       <NA>
# 6       USA      EU              2   200  TRUE       <NA>
# 7       USA      EU             21   170  TRUE       <NA>
# 8       USA      EU            211   170  TRUE       <NA>
# 9       USA      EU             22    30  TRUE       <NA>
# 10      USA      EU              3   220  TRUE       <NA>
# 11       EU     USA              1   190  TRUE       <NA>
# 12       EU     USA             11   190 FALSE 11;111;112
# 13       EU     USA            111   120  TRUE       <NA>
# 14       EU     USA            112    30  TRUE       <NA>
# 15       EU     USA              2   300  TRUE       <NA>
# 16       EU     USA             21   200  TRUE       <NA>
# 17       EU     USA            211   150  TRUE       <NA>
# 18       EU     USA            212    50  TRUE       <NA>
# 19       EU     USA             22   100  TRUE       <NA>

Explanation

  • The function checkLowerLevels checks completeness for any code n filtered by reporter & partner (such that n = 1 is treated once for USA -> EU and once for 'EU -> USA')
  • To do so, it automatically looks one level lower and compares aggregated lower with upper values
  • If a code is incomplete, it also returns the code n and the sublevels n1, n2, ..., nm in the format n;n1;n2;...;nm

Upvotes: 1

Related Questions