Filter (hierarchical) data with conditions and sub-categories

Question

I need to filter my data, that is somehow hierarchical according to some conditions.

my data on exports looks something like this, but for multiple countries and years.

df3dgt <- data.frame(
"Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA", "USA", "USA","USA","EU", "EU","EU","EU","EU", "EU","EU","EU","EU"),
"Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU", "EU","EU","EU","USA", "USA", "USA","USA","USA", "USA", "USA","USA","USA"), 
"Commodity code" = c("1", "11", "111", "112", "12","2", "21","211", "22", "3", "1", "11", "111", "112", "2", "21", "211", "212", "22"), 
 "Value" = c( 100, 50, 25, 5, 40, 200, 170, 170, 30, 220, 190, 190, 120, 30, 300, 200, 150, 50, 100), 
 stringsAsFactors = FALSE)

Commodity codes aggregate data at different levels. For instance, 111 (eg. apple) and 112 (e.g. bananas) are sub-groups of commodity 11 (e.g. fruit), similarly, 11 (fruit) and 12 (vegetables) are subcategories of 1 (e.g. food).

I need to filter the data to separate complete data from the rest.

I want to filter according to two conditions:

(1) filter the data where the "value" of the sub-commodity categories is equal to the value reported at the higher level of aggregation. For instance, Commodity code 1 of USA export to EU is incomplete. Commodity 112 (val=5) and commodity 111(val=25) do not aggregate to the value of commodity 11 (val=50). similarly the value of 11 (val=50) and 12 (val=40) do not aggregate to the value of commodity code 1 (100) Conversely, category 2 of EU export to US is complete. Commodity 211 (val=150) and 212 (val=50) aggregate to the level of commodity 21 (val=200). Also, the value of product category 21 (200) and 22 (100) aggregate to the level of Commodity 2.

2) I also what to filter separately the data that is only reported at higher levels of commodity code. For reference of which data is reported only at higher levels, please consider the illustratory list of commodity code below:

 Comlist <- c("1", "11", "111", "112", "12","2", "21","211", "22","221", "3","31", "32", "311", "321")
 Comlist <- as.data.frame(Comlist)

Hence, I want to filter, in the export between USA and EU, commodity 22 because I know a category 221 exist and it is not reported. Similarly, for category 3, that it is not reported in its lower levels.

To deal with (1) I am considering one level at the time (first two and three digit product category and then one and two). I first create a new variable for every level of product category

# create new variable Prodcat1
df1 <- df %>%
group_by(Reporter, Partner) %>%
mutate(Prodcat1 = str_extract(Product.cat., "^.{1}")) 

# create new variable Prodcat2 for my 2nd level product category
df2 <- df1 %>%
group_by(Reporter, Partner) %>%
mutate(Prodcat2 = str_extract(Product.cat., "^.{2}"))

Then I filter

 df2.Incomplete <- df2 %>%
 group_by(Reporter, Partner, Prodcat2) %>%
 filter(sum(Val[2:n()]) < Val[1])`

This, however, only includes data with the commodity code with two or three digits, while I would like to incorporate also the 1st digit of incomplete data. e.g. it reports rows with "commodity code" 111 and 11 but not the commodity code "1" of incomplete groups. Moreover, I am not sure to proceed to filter for case (2) considering that I have a hundred countries and product categories to consider.

thank you very much in advance for your help.

Filter (hierarchical) data with conditions and sub-categories

Answers (1)

Related Questions