marbel
marbel

Reputation: 7714

Aggregate using data.table

I'm looking for a simpler way to aggregate and calculate percentages of a numerical variable using data.table. The following code outputs the desired result, my question is if there is a better way to get the same result. I'm not really familiarized with the package, so any tips would be useful.

I'd like to have the following columns:

   second_factor_variable third_factor_variable factor_variable       porc porcentaje
1:                   HIGH                     C           > 200 0.04456544        4 %
2:                    LOW                     A        51 - 100 0.31739130       32 %
3:                    LOW                     A       101 - 200 0.68260870       68 %
4:                    LOW                     A         26 - 50 0.00000000        0 %

Where porc is the numerical percentage and porcentage would be the percentage rounded to be used as a label in a ggplot call.

library("ggplot2")
library("scales")
library("data.table")

### Generate some data
set.seed(123)
df <- data.frame(x = rnorm(10000, mean = 100, sd = 50))
df <- subset(df, x > 0)

df$factor_variable <- cut(df$x, right = TRUE, 
                          breaks = c(0, 25, 50, 100, 200, 100000),
                          labels = c("0 - 25", "26 - 50", "51 - 100", "101 - 200", "> 200")
                          )

df$second_factor_variable <- cut(df$x, right = TRUE, 
                                 breaks = c(0, 100, 100000),
                                 labels = c("LOW", "HIGH")
                                 )

df$third_factor_variable <- cut(df$x, right = TRUE, 
                                 breaks = c(0, 50, 100, 100000),
                                 labels = c("A", "B","C")
                                )

str(df)

### Aggregate
DT <- data.table(df)
dt = DT[, list(factor_variable = unique(DT$factor_variable),
              porc = as.numeric(table(factor_variable)/length(factor_variable)),
              porcentaje = paste( round( as.numeric(table(factor_variable)/length(factor_variable), 0 ) * 100 ), "%")
              ), by="second_factor_variable,third_factor_variable"]

EDIT

I've tried agstudy's solution grouping by with just one variable, and I believe it didn't work for producing the labels (porcentaje column). In the real dataset, I ended up having a similar issue and I can't spot whats wrong about this function.

grp <- function(factor_variable) {
  porc = as.numeric(table(factor_variable)/length(factor_variable))
  list(factor_variable = factor_variable[1],
       porc =porc,
       porcentaje = paste( round( porc, 0 ) * 100 , "%"))
}

DT[, grp(factor_variable) , by="second_factor_variable"]

The numerical values are correct

DT2 <- DT[DT$second_factor_variable %in% "LOW"]
table(DT2$factor_variable)/length(DT2$factor_variable)

I believe the same problems appears if i group by with 2 factor variables:

DT[, grp(factor_variable) , by="second_factor_variable,third_factor_variable"]

Upvotes: 3

Views: 1038

Answers (2)

marbel
marbel

Reputation: 7714

In the real dataset for some reason, the previous function wasn't working well. The values and labels weren't right.

I've just changed a tiny bit and it worked. I add it as a separate answer so it's easier to find later.

This keeps the order the factor levels as they were created. The previous way keept them as they were ordered in the DT, for the particular data i was working with this wasn't working well. The labels work now, I mean the porcentaje column.

factor_variable = levels(factor_variable) 

grp2 <- function(factor_variable) {
  porc = as.numeric(table(factor_variable)/length(factor_variable))
  list(factor_variable = levels(factor_variable), 
       porc = porc,
       porcentaje = paste( round( as.numeric(table(factor_variable)/length(factor_variable), 0 ) * 100 ), "%")
       )
}


DT[, grp2(factor_variable) , by="second_factor_variable"]

Upvotes: 0

agstudy
agstudy

Reputation: 121568

2 changes : factorize porc variable and don't use DT to compute factor_variable

DT[, {   porc = as.numeric(table(factor_variable)/length(factor_variable))
         list(factor_variable = factor_variable[1],
               porc =porc,
               porcentaje = paste( round( porc, 0 ) * 100 , "%"))
        }
, by="second_factor_variable,third_factor_variable"]

Upvotes: 4

Related Questions