Xiwen Chen
Xiwen Chen

Reputation: 1

columns unknown when applying function to dataframe in r

To simplify, say, I have a dataset like this:

num = c(1,2,3,"NA",3,4,1,2,1)
char = c('a','b','s','s','s','s','a','s','s')
t = as.data.frame(cbind(num,char))   

and I wrote a function to find top 5 values of each column:

 func_top5 = function(x){t%>%
    filter(!is.na(x))%>%
    group_by(x)%>%
    summarise(number_of_same_value = n())%>%
    arrange(desc(number_of_same_value))%>%
    slice(1:5)}

when I tried to apply this function to the df,

apply(t,2,func_top5)

it returned the error:

Error in grouped_df_impl(data, unname(vars), drop) : Column x is unknown

But when I just use the function separately, it works totally fine:

t%>%
  filter(!is.na(num))%>%
  group_by(num)%>%
  summarise(number_of_same_value = n())%>%
  arrange(desc(number_of_same_value))%>%
  slice(1:5)

# A tibble: 5 x 2
     num number_of_same_value
  <fctr>                <int>
1      1                    3
2      2                    2
3      3                    2
4      4                    1
5     NA                    1

I think the problem might be the "group_by" function.

Can anyone help me with this?

Upvotes: 0

Views: 852

Answers (1)

akrun
akrun

Reputation: 886948

We can use the quosure way to solve this. Assuming that they input argument 'x' is not quoted, we can convert it to quosure with enquo, then evaluate within the group_by, filter using bang-bang operator(!!). Note that, it is better to have the dataset object also as the input argument for useability of the function in a more general way. It is not clear whether the missing values are quoted or not. The more acceptable way if it is a true NA is is.na

func_top5 <- function(df, x){
   x <- enquo(x)
   df %>%
       filter(! (!!(x) %in% c("NA", "")))%>%
        group_by(!! x)%>%
        summarise(number_of_same_value = n())%>%
        arrange(desc(number_of_same_value))%>%
        slice(1:5)
     }

We call it by

func_top5(df1, col1)
# A tibble: 2 x 2
#   col1  number_of_same_value
#   <chr>                <int>
#1 b                        3
#2 a                        2

One option to do this on multiple columns would be

map(names(t), ~ func_top5(t1, !! rlang::sym(.x)))
#[[1]]
# A tibble: 5 x 2
#    num number_of_same_value
#  <dbl>                <int>
#1  1.00                    3
#2  2.00                    2
#3  3.00                    2
#4  4.00                    1
#5 NA                       1

#[[2]]
# A tibble: 3 x 2
#  char  number_of_same_value
#  <chr>                <int>
#1 s                        6
#2 a                        2
#3 b                        1

data

df1 <- data.frame(col1 = c("a", "b", "NA", "", "a", "b", "b"), 
      col2 = rnorm(7), stringsAsFactors = FALSE)

Upvotes: 1

Related Questions