Reputation: 575
I have a dataframe in R in which values correspond to value estimates and their margin of error (MoE).
Column names consist of a pattern, an indicator character (e = estimate, m = margin of error) and an ID that matches estimate and margin of error.
So, the column names look like "XXXe1, XXXm1, XXXe2, XXXm2, ...".
I am trying to create a function to (for each row)
Calculate the sum of the estimates. (That is pretty straightforward.)
Calculate the aggregated margin of error. This is the square root of the sum of the squares of each MoE.
Condition: the MoE of estimates marked as 0 should only be added once.
Examples:
sqrt(123^2)
. So far, I have done the following to build a function that does this:
estimate_aggregator <- function(DF_to_write_on, New_column_name, source_df, pattern){
subset_df <- source_df[, grepl(pattern, names(source_df))] # I subset all the columns named with the pattern, regardless of whether they are estimate or margin of error
subset_df_e <- source_df[, grepl(paste0(pattern, "e"), names(source_df))] # I create a table with only the estimated values to perform the sum
DF_to_write_on[paste0(New_column_name, "_e")]<- rowSums(subset_df_e) # I write a new column in the new DF with the rowSums of the estimates values, having calculated the new estimate
return(DF)
}
What I am missing: a way to write in the new dataframe the result of selecting the XXXmYY values of those columns that have no 0 value in their corresponding estimate. If there is one or more 0 in the estimates, then I should include the MoE 123 in the calculation only once.
What would be the cleanest way to achieve this? I see that my struggle is on dealing with several columns at once and the fact that the values on the XXXeYY columns determine the selection of the XXXmYY ones.
row1: DF_to_write_on[paste0(New_column_name,"_m") <- sqrt(176^2 + 117^2+22^2 + 123^2)
row2: DF_to_write_on[paste0(New_column_name,"_m") <- sqrt(123^2)
B01001e1 B01001m1 B01001e2 B01001m2 B01001e3 B01001m3 B01001e4 B01001m4 B01001e5 B01001m5
15 566 176 371 117 14 22 0 123 0 123
20 0 123 0 123 0 123 0 123 0 123
structure(list(B01001e1 = c(1691L, 2103L, 975L, 2404L, 866L,
2140L, 965L, 727L, 1602L, 1741L, 948L, 1771L, 1195L, 1072L, 566L,
1521L, 2950L, 770L, 1624L, 0L), B01001m1 = c(337L, 530L, 299L,
333L, 264L, 574L, 227L, 266L, 528L, 498L, 320L, 414L, 350L, 385L,
176L, 418L, 672L, 226L, 319L, 123L), B01001e2 = c(721L, 1191L,
487L, 1015L, 461L, 1059L, 485L, 346L, 777L, 857L, 390L, 809L,
599L, 601L, 371L, 783L, 1215L, 372L, 871L, 0L), B01001m2 = c(173L,
312L, 181L, 167L, 170L, 286L, 127L, 149L, 279L, 281L, 152L, 179L,
193L, 250L, 117L, 234L, 263L, 155L, 211L, 123L), B01001e3 = c(21L,
96L, 70L, 28L, 33L, 90L, 12L, 0L, 168L, 97L, 72L, 10L, 59L, 66L,
14L, 0L, 35L, 47L, 14L, 0L), B01001m3 = c(25L, 71L, 73L, 26L,
33L, 79L, 18L, 123L, 114L, 79L, 59L, 15L, 68L, 99L, 22L, 123L,
31L, 37L, 20L, 123L), B01001e4 = c(30L, 174L, 25L, 91L, 4L, 27L,
30L, 43L, 102L, 66L, 54L, 85L, 0L, 16L, 0L, 26L, 34L, 27L, 18L,
0L), B01001m4 = c(26L, 148L, 30L, 62L, 9L, 27L, 25L, 44L, 82L,
52L, 46L, 48L, 123L, 21L, 123L, 40L, 33L, 32L, 27L, 123L), B01001e5 = c(45L,
44L, 7L, 46L, 72L, 124L, 45L, 34L, 86L, 97L, 0L, 83L, 0L, 30L,
0L, 66L, 0L, 23L, 33L, 0L), B01001m5 = c(38L, 35L, 12L, 37L,
57L, 78L, 36L, 37L, 62L, 97L, 123L, 50L, 123L, 42L, 123L, 59L,
123L, 31L, 49L, 123L)), .Names = c("B01001e1", "B01001m1", "B01001e2",
"B01001m2", "B01001e3", "B01001m3", "B01001e4", "B01001m4", "B01001e5",
"B01001m5"), row.names = c(NA, 20L), class = "data.frame")
Upvotes: 1
Views: 397
Reputation: 2698
From your description it sounds like your desired output should have 2 columns, the row sum of the estimate, and the function of the row margins of errors using the logic you describe. Here is one (somewhat roundabout) solution to that problem.
I saved your data as df
.
# Isolate estimate and MoE dataframes
df_e <- df[,grepl('e', names(df))]
df_m <- df[,grepl('m', names(df))]
# Temporary matrix used to isolate 0 values for MoE, count number of zero occurances, and convert those MoE values to NA
mat <- df_e == 0
mat <- t(apply(mat, 1, cumsum))
df_m[mat > 1] = NA
# Combine with estimate row sum
output_df <- data.frame(
e = rowSums(df[,grepl('e', names(df))]),
m = apply(df_m, 1, function(x) sqrt(sum(x^2, na.rm = T)))
)
head(output_df)
e m
1 2508 382.4173
2 3608 637.5061
3 1564 358.5178
4 3584 380.3512
5 1436 320.9595
6 3440 651.4031
Upvotes: 2