Reputation: 1326
I am dealing with a situation wherein I have multiple, distinct data sets with different column names, but the functions to be applied to them are similar. I thought, to reduce code duplication, I could create another dataset of column names, and the function to be applied to them:
### The raw data set
df1 <- tibble(A=c(NA, 1, 2, 3), B = c(1,2,1,NA),
C = c(NA,NA,NA,2), D = c(2,3,NA,1), E = c(NA,NA,NA,1))
# A tibble: 4 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 NA 1 NA 2 NA
2 1 2 NA 3 NA
3 2 1 NA NA NA
4 3 NA 2 1 1
### The dataframe containing functions
funcDf <- tibble(colNames = names(df1), type = c(rep("Compulsory", 4), "Conditional"))
funcDf$func <- c("is.na()", "is.na()", "is.na()", "is.na()",
"ifelse(!is.na(D) & is.na(E), 0, ifelse(!is.na(D) & !is.na(E), 1, 0))")
# A tibble: 5 x 3
colNames type func
<chr> <chr> <chr>
1 A Compulsory is.na()
2 B Compulsory is.na()
3 C Compulsory is.na()
4 D Compulsory is.na()
5 E Conditional ifelse(!is.na(D) & is.na(E), 0, ifelse(!is.na(D) & !is.na(E), 1,~
I am able to get a simple sum running, like so:
df1 %>% summarise_at(.vars = funcDf$colNames, .funs = list(~sum(., na.rm = T)))
But I am not able to apply the functions I have recorded in the dataframe against the corresponding variable.
Any guidance, please :)
Edit
I expect to have the following output as a result of applying the function:
# A tibble: 1 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 3 1 2
@YinYan, thanks so much for indulging me, but for my comment, what if I need the following output (with grouping, as you can see in my code):
df1 %>% group_by(A, B) %>% summarise_all(.funs = list(~sum(., na.rm = T)))
# A tibble: 4 x 5
# Groups: A [4]
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 0 3 0
2 2 1 0 0 0
3 3 NA 2 1 1
4 NA 1 0 2 0
Upvotes: 0
Views: 418
Reputation: 6106
I modified the function column, so they are now functions instead of string. Since the function for column E is always referencing df1
, so I added with
in the function.
funcDf$func <- c(
function(x) is.na(x),
function(x) is.na(x),
function(x) is.na(x),
function(x) is.na(x),
function(x) with(data = df1, data.frame(E = ifelse(!is.na(D) & is.na(E), 0, ifelse(!is.na(D) & !is.na(E), 1, 0))))
)
result <- map_dfc(funcDf$colNames,function(colName){
colFunc <- dplyr::pull(funcDf[funcDf$colNames == colName,"func"])[[1]]
data.frame(colFunc(df1[,colName]))
})
> result
A B C D E
1 TRUE FALSE TRUE FALSE 0
2 FALSE FALSE TRUE FALSE 0
3 FALSE FALSE TRUE TRUE 0
4 FALSE TRUE FALSE FALSE 1
To get the final result:
> summarise_all(result,sum)
A B C D E
1 1 1 3 1 1
I have to modify the function column since this time column E function depends on different data frame. After use group_split()
to split the original data frame into a list of data frames. You can then use for loop or map
function to iterate the process. I personally like to use map
functions since the codes are more concise.
funcDf$func <- c(
function(x,...) is.na(x),
function(x,...) is.na(x),
function(x,...) is.na(x),
function(x,...) is.na(x),
function(x,df) with(data = df, data.frame(E = ifelse(!is.na(D) & is.na(E), 0, ifelse(!is.na(D) & !is.na(E), 1, 0))))
)
df_list <- df1 %>% group_by(A, B) %>% group_split()
map_dfr(df_list, function(parent_df){
map_dfc(funcDf$colNames,function(colName){
colFunc <- dplyr::pull(funcDf[funcDf$colNames == colName,"func"])[[1]]
data.frame(colFunc(parent_df[,colName],df = parent_df))
}) %>%
summarise_all(sum)
})
A B C D E
1 0 0 1 0 0
2 0 0 1 1 0
3 0 1 0 0 1
4 1 0 1 0 0
Upvotes: 1