Pablo Herreros Cantis
Pablo Herreros Cantis

Reputation: 575

calculate sum of values in dataframe based on values in other columns

I have a dataframe in R in which values correspond to value estimates and their margin of error (MoE).

Column names consist of a pattern, an indicator character (e = estimate, m = margin of error) and an ID that matches estimate and margin of error.
So, the column names look like "XXXe1, XXXm1, XXXe2, XXXm2, ...".

Goal

I am trying to create a function to (for each row)

  1. Calculate the sum of the estimates. (That is pretty straightforward.)

  2. Calculate the aggregated margin of error. This is the square root of the sum of the squares of each MoE.

  3. Condition: the MoE of estimates marked as 0 should only be added once.

Examples:

So far, I have done the following to build a function that does this:

estimate_aggregator <- function(DF_to_write_on, New_column_name, source_df, pattern){

  subset_df <- source_df[, grepl(pattern, names(source_df))] # I subset all the columns named with the pattern, regardless of whether they are estimate or margin of error
  subset_df_e <- source_df[, grepl(paste0(pattern, "e"), names(source_df))] # I create a table with only the estimated values to perform the sum

  DF_to_write_on[paste0(New_column_name, "_e")]<- rowSums(subset_df_e) # I write a new column in the new DF with the rowSums of the estimates values, having calculated the new estimate

  return(DF)
}

What I am missing: a way to write in the new dataframe the result of selecting the XXXmYY values of those columns that have no 0 value in their corresponding estimate. If there is one or more 0 in the estimates, then I should include the MoE 123 in the calculation only once.

What would be the cleanest way to achieve this? I see that my struggle is on dealing with several columns at once and the fact that the values on the XXXeYY columns determine the selection of the XXXmYY ones.

Expected output

row1: DF_to_write_on[paste0(New_column_name,"_m") <- sqrt(176^2 + 117^2+22^2 + 123^2)
row2: DF_to_write_on[paste0(New_column_name,"_m") <- sqrt(123^2)

   B01001e1 B01001m1 B01001e2 B01001m2 B01001e3 B01001m3 B01001e4 B01001m4 B01001e5 B01001m5
15      566      176      371      117       14       22        0      123        0      123
20        0      123        0      123        0      123        0      123        0      123

Data

structure(list(B01001e1 = c(1691L, 2103L, 975L, 2404L, 866L, 
2140L, 965L, 727L, 1602L, 1741L, 948L, 1771L, 1195L, 1072L, 566L, 
1521L, 2950L, 770L, 1624L, 0L), B01001m1 = c(337L, 530L, 299L, 
333L, 264L, 574L, 227L, 266L, 528L, 498L, 320L, 414L, 350L, 385L, 
176L, 418L, 672L, 226L, 319L, 123L), B01001e2 = c(721L, 1191L, 
487L, 1015L, 461L, 1059L, 485L, 346L, 777L, 857L, 390L, 809L, 
599L, 601L, 371L, 783L, 1215L, 372L, 871L, 0L), B01001m2 = c(173L, 
312L, 181L, 167L, 170L, 286L, 127L, 149L, 279L, 281L, 152L, 179L, 
193L, 250L, 117L, 234L, 263L, 155L, 211L, 123L), B01001e3 = c(21L, 
96L, 70L, 28L, 33L, 90L, 12L, 0L, 168L, 97L, 72L, 10L, 59L, 66L, 
14L, 0L, 35L, 47L, 14L, 0L), B01001m3 = c(25L, 71L, 73L, 26L, 
33L, 79L, 18L, 123L, 114L, 79L, 59L, 15L, 68L, 99L, 22L, 123L, 
31L, 37L, 20L, 123L), B01001e4 = c(30L, 174L, 25L, 91L, 4L, 27L, 
30L, 43L, 102L, 66L, 54L, 85L, 0L, 16L, 0L, 26L, 34L, 27L, 18L, 
0L), B01001m4 = c(26L, 148L, 30L, 62L, 9L, 27L, 25L, 44L, 82L, 
52L, 46L, 48L, 123L, 21L, 123L, 40L, 33L, 32L, 27L, 123L), B01001e5 = c(45L, 
44L, 7L, 46L, 72L, 124L, 45L, 34L, 86L, 97L, 0L, 83L, 0L, 30L, 
0L, 66L, 0L, 23L, 33L, 0L), B01001m5 = c(38L, 35L, 12L, 37L, 
57L, 78L, 36L, 37L, 62L, 97L, 123L, 50L, 123L, 42L, 123L, 59L, 
123L, 31L, 49L, 123L)), .Names = c("B01001e1", "B01001m1", "B01001e2", 
"B01001m2", "B01001e3", "B01001m3", "B01001e4", "B01001m4", "B01001e5", 
"B01001m5"), row.names = c(NA, 20L), class = "data.frame")

Upvotes: 1

Views: 397

Answers (1)

astrofunkswag
astrofunkswag

Reputation: 2698

From your description it sounds like your desired output should have 2 columns, the row sum of the estimate, and the function of the row margins of errors using the logic you describe. Here is one (somewhat roundabout) solution to that problem.

I saved your data as df.

# Isolate estimate and MoE dataframes
df_e <- df[,grepl('e', names(df))]
df_m <- df[,grepl('m', names(df))]

# Temporary matrix used to isolate 0 values for MoE, count number of zero occurances, and convert those MoE values to NA
mat <- df_e == 0
mat <- t(apply(mat, 1, cumsum))
df_m[mat > 1] = NA


# Combine with estimate row sum
output_df <- data.frame(
  e = rowSums(df[,grepl('e', names(df))]),
  m = apply(df_m, 1, function(x) sqrt(sum(x^2, na.rm = T)))
)

head(output_df)
     e        m
1 2508 382.4173
2 3608 637.5061
3 1564 358.5178
4 3584 380.3512
5 1436 320.9595
6 3440 651.4031

Upvotes: 2

Related Questions