Reputation: 2716

how to count number unique values in each column, R, efficiently

my goal is to look for how many unique values each column in my data frame has, here is what I came up with

### df is a data frame, 32 named columns, millions of rows 

test1 <- sapply(df, function(x) length(unique(x)))

### I get a named integer from the above command 

test2 <- data.frame(names(test1), test1)

### now I get a data frame, with row names

row.names(test2) <- NULL

### to get rid of the row names 

test3 <- test2[order(test1),]

### finally I get a what I want

my question would be, how do I do this in a smaller number of steps???

Upvotes: 1

Answers (3)

Federico Rotolo

Reputation: 57

Is this doing what you mean?

    test1 <- sort(sapply(df, function(x) length(unique(x))), decreasing = T)
    data.frame(names(test1), test1, row.names = NULL)

Upvotes: 1

giac

Reputation: 4299

I am not sure if this is what you want. Please provide a sample of your dataset (with dput)

Imagine you want to count the number of unique values for the data mtcars.

library(tidyr)
library(dplyr)

mtcars %>% 
  gather() %>% 
  group_by(key) %>% 
  summarise( ndist = n_distinct(value) ) %>% 
  arrange(desc(ndist))

This will give you

    key ndist
1  qsec    30
2    wt    29
3  disp    27
4   mpg    25
5    hp    22
6  drat    22
7  carb     6
8   cyl     3
9  gear     3
10   vs     2
11   am     2

Upvotes: 3

LyzandeR

Reputation: 37879

One call in base R:

#using the same column names as in your example
test1 <- data.frame(names.test1 = colnames(mtcars), 
                    test1=sapply(mtcars, function(x) length(unique(x))),
                    row.names=NULL)

Output:

> test1
   names.test1 test1
1          mpg    25
2          cyl     3
3         disp    27
4           hp    22
5         drat    22
6           wt    29
7         qsec    30
8           vs     2
9           am     2
10        gear     3
11        carb     6

This would then require manual ordering though as @BenBolker mentions in the comment:

test1 <- test1[order(test1$test1),])

However, you could do an ordered one-liner with data.table:

library(data.table)
test1 <- data.table(names.test1 = colnames(mtcars), 
                    test1=sapply(mtcars, function(x) length(unique(x))),
                    key='test1')

> test1
    names.test1 test1
 1:          vs     2
 2:          am     2
 3:         cyl     3
 4:        gear     3
 5:        carb     6
 6:          hp    22
 7:        drat    22
 8:         mpg    25
 9:        disp    27
10:          wt    29
11:        qsec    30

Upvotes: 4

how to count number unique values in each column, R, efficiently

Answers (3)

Related Questions