Is it possible to do your own efficient descriptive statistics function? - R

Question

Usually, I find myself using a few summary functions or making my own computations to get some additional initial information from the data. For example, I wanted to see the count and percentage per variable given a limit of distinct values:

table_transposed <- function(vector){

    merge(as.data.frame(table(vector, dnn="values")),
          as.data.frame(round(prop.table(table(vector, dnn="values")),2)), 
          by="values", 
          all.x=TRUE) %>% 
    data.table::transpose(keep.names = "values",
                          make.names = names(.)[1]) %T>%
    {.[,c("values")] <- c("Count", "Percentage")}
    }
table_transposed_filter <- function(dataframe, max_number_categories) {
    (lapply(dataframe, function(x) NROW(unique(x))) <= max_number_categories) %>% 
            as.vector() %>% 
            {dataframe[,.]} %>% 
            lapply(table_transposed)
            }

So, you give the dataframe and the threshold of distinct values per variable.

table_transposed_filter(mtcars, 10)

However, it's SUPER slow (maybe because of using merge() instead of left_join() from dplyr). Now, I'm trying to figure an efficient, fast, and simple way to do a combination of psych::describe(), Hmisc::describe(), other, and my own, for numeric and categorical variables (one descriptive function for each one). Something like (for numerical):

If I create this table with mainly with sapply() for example, is it better (more efficient, faster, simpler code) than actually learning to create a r-package and developing in there?

PS: I thought to put this question in StackMetaExchange or Crossvalidation, but none of them seem to fit it.

Gregor Thomas · Accepted Answer

Here's a somewhat faster version. It's about 2x faster on small data (like mtcars), but the difference narrows on litte bit on larger data.

This makes sense as the most expensive operation you do is table - your version does it twice, my version does it once. I didn't profile the code, but my guess is table is the bottleneck by more than one order of magnitude on any sizeable data, so it's a waste to try to optimize any other parts of the code.

t_transp = function(x, digits = 2) {
  tab = table(x)
  prop_tab = prop.table(tab)
  df = data.frame(values = c("Count", "Percentage"))
  df = cbind(df, rbind(tab, round(prop_tab, digits = digits)))
  row.names(df) = NULL
  df
}

t_transp_filter = function(data, n_max, ...) {
  lapply(Filter(function(x) NROW(unique(x)) <= n_max, data), t_transp, ...)
}

Benchmarking:

microbenchmark::microbenchmark(
  gregor = t_transp_filter(mtcars, n_max = 4),
  OP = table_transposed_filter(mtcars, 4),
  times = 20
)
# Unit: milliseconds
#    expr    min     lq     mean  median      uq    max neval cld
#  gregor 1.6483 1.7093 2.253425 1.74765 1.84680 7.5394    20  a 
#      OP 5.6988 5.7627 6.316295 6.08545 6.57965 8.1048    20   b

set.seed(47)
df = as.data.frame(matrix(
  c(sample(letters[1:5], size = 1e5 * 20, replace = T))
  , ncol = 20))

microbenchmark::microbenchmark(
  gregor = t_transp_filter(df, n_max = 5),
  OP = table_transposed_filter(df, 5),
  times = 20
)
# Unit: milliseconds
#    expr      min        lq     mean    median       uq      max neval cld
#  gregor  59.5466  59.95545  63.6825  61.14075  67.2167  75.4270    20  a 
#      OP 110.3265 117.35585 123.8782 118.91005 133.7795 149.0651    20   b

Is it possible to do your own efficient descriptive statistics function? - R

Answers (1)

Related Questions