data_science_actuary
data_science_actuary

Reputation: 61

r data.table group by with no aggregate

How do I get a data table in R to just return a column of grouped values where I am applying no other aggregate functions? Say I have:

test<-data.table(x=c(rep("a",2),rep("b",3)),y=1:5)

And I just want to return:

a
b

When I use:

test[,,by=x]

I get back:

   x y
1: a 1
2: a 2
3: b 3
4: b 4
5: b 5

And when I do:

test[,x,by=x]

I get back:

   x x
1: a a
2: b b

I know I can use:

test[,.(unique(x))]

But that doesn't seem like the right way to do it and besides what if I wanted to return two columns grouped?

Upvotes: 6

Views: 2597

Answers (4)

Nuissance Parameter
Nuissance Parameter

Reputation: 21

I wrote an R function to do this which can be found on my Github, but I'll provide it here as well.

https://github.com/seanpili/R_PROC_TRANSPOSE

# This function mimics one of the features of SAS's PROC transpose function:
# allowing the user to do a group_by statement without aggregating the data
# producing a dataframe
#where each row represents one of the  groups that is produced, and the columns 
#represent an observation in one of those groups.
library(dplyr)

transp <- function(input,uniq_var,compare_var,transposed_column_names = 'measurement'){
  if(class(input[,uniq_var]) == "factor"){
    input[uniq_var] = sapply(input[uniq_var],as.character)
  }
  #' input is the dataframe/data.table that you want to perform the operation on, uniq_var is the variable that you are groupying by, compare_var is the variable that is being measured in each of the groups, and transposed_colum_names is just an optional string for the user to call each of their columns (will be concatenated with an observation number, i.e. if you input 'distance', it will name the observations  'distance_1','distance_2','distance_3'...ect.)
  list_df <- input %>% group_by(input[,uniq_var]) %>% do(newcol = t(.[compare_var]))
  # it gets us the aggregates we want, BUT all of our columns are stored in a list 
  # instead of in separate columns.... so we need to create a new dataframe with the dimensions 
  # rows = the number of unique values that we are "grouping" by, noted here by uniq_var and the number of columns will be 
  # the maximum number of observations that are assigned to one of those groups.

  # so first we will create the skeleton of the matrix, and then use a user defined function 
  # to fill it with the correct values 
  new_df <- matrix(rep(NA,(max(count(input,input[,uniq_var])[,2])*dim(list_df)[1])),nrow = dim(list_df)[1])
  new_df <- data.frame(new_df)
  new_df <- cbind(list_df[,1],new_df)
  # i am writing a function inside of a function becuase for loops can take a while 
  # when doing operaitons on multiple columns of a dataframe
  func2 <- function(input,thing = new_df){

    # here, we have a slightly easier case when we have the maximum number of children 
    # assigned to a household.
    # we subtract 1 from the number of columns because the first column holds the value of the 
    # unique value we are looking at, so we don't count it 

    if(length(input[2][[1]])==dim(thing)[2]-1){
      # we set the row corresponding to the specific unique value specified in our list_df of aggregated values
      # equal to the de-aggregated values, so that you have a column for each value like in PROC Transpose. 
      thing[which(thing[,1]==input[1]),2:ncol(thing)]= input[2][[1]]

      #new_df[which(new_df[,1]==input[1]),2:ncol(new_df)]= input[2][,1][[1]][[1]]
    }else{
      thing[which(thing[,1]==input[1]),2:(1+length(input[2][[1]]))]= input[2][[1]]
    }
    # if you're wondering why I have to use so many []'s it's because our list_df has 1 column 
    # of unique identifiers and the other column is actually a column of dataframes
    # each of which only has 1 row and 1 column, and that element is a list of the transposed values 
    # that we want to add to our new dataframe 
    # so essentially the first bracket 

    return(thing[which(thing[,1]==input[1]),])
  }

  quarter_final_output <- apply(list_df,1,func2)
  semi_final_output <- data.frame(matrix(unlist(quarter_final_output),nrow = length(quarter_final_output),byrow = T))
  #return(apply(list_df,1,func2))
  # this essentially names the columns according to the column names that a user would typically specify 
  # in a proc transpose. 
  name_trans <- function(trans_var=transposed_column_names,uniq_var = uniq_var,df){
    #print(trans_var)
    colnames(df)[1] = colnames(input[uniq_var])
    colnames(df)[2:length(colnames(df))] = c(paste0(trans_var,seq(1,(length(colnames(df))-1),1)))
    return(df)

  }
  final_output <- name_trans(transposed_column_names,uniq_var,semi_final_output)
  return(final_output)

}

Upvotes: 2

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162341

I'd accomplish this by applying unique() to a data.table containing just the subset of grouping columns in which I was interested. Handing a data.table to unique(), as below, will trigger a call to unique.data.table(), which works just as well for two or more columns as for one:

unique(test[, .(x)]) ## .() is data.table shorthand for list()
#    x
# 1: a
# 2: b

## Add another column to see that unique.data.table() works fine in that case as well 
test[, z:=c(1,1,1,2,2)]
unique(test[, .(x,z)])   
#    x z
# 1: a 1
# 2: b 1
# 3: b 2

Upvotes: 7

DashingQuark
DashingQuark

Reputation: 105

Late to the party but I know what you ask for

No direct answer but here is a workaround.

 test[,x,by=x][,x]  # Suppress one of the x's

   [1] "a" "b"

invisible() should also have worked as mentioned below:

I’m using j for its side effect only, but I’m still getting data returned. How do I stop that? In this case j can be wrapped with invisible(); e.g., DT[,invisible(hist(colB)),by=colA] http://datatable.r-forge.r-project.org/datatable-faq.pdf

Or that would have been a solution too.

 test[,invisible(x),by=x]  # Still prints j, just hides its name!

   x V1
1: a  a
2: b  b

However the following might make you give up the quest happily:

Why is grouping by columns in the key faster than an ad hoc by?

Because each group is contiguous in RAM, thereby minimising page fetches and memory can be copied in bulk (memcpy in C) rather than looping in C. http://datatable.r-forge.r-project.org/datatable-faq.pdf

Upvotes: 0

ctaggart
ctaggart

Reputation: 166

Agreeing with Josh that unique() is the right choice, but maybe consider this approach:

> unique(test$x) 
[1] "a" "b"

Also, if you want rows:

> rbind(unique(test$x))
     [,1] [,2]
[1,] "a"  "b" 

Or columns:

> cbind(unique(test$x))
     [,1]
[1,] "a" 
[2,] "b" 

Upvotes: 0

Related Questions