Zfunk
Zfunk

Reputation: 1193

Count unique values for every column

I would like to return the count of the unique (distinct) values for every column in a data frame. For example, if I have the table:

 Testdata <- data.frame(var_1 = c("a","a","a"), var_2 = c("b","b","b"), var_3 = c("c","d","e"))

 var_1 | var_2 | var_3
 a     | b     | c 
 a     | b     | d
 a     | b     | e

I would like the output to be:

 Variable | Unique_Values
 var_1    | 1
 var_2    | 1
 var_3    | 3

I have tried playing around with loops using the unique function, e.g.

 for(i in names(Testdata)){
    # Code using unique function
 }

However I suspect there is a simpler way.

Upvotes: 33

Views: 53777

Answers (10)

mgrund
mgrund

Reputation: 1625

With more recent syntax using dplyr:

library(tidyverse)

Testdata <- data.frame(var_1 = c("a","a","a"), var_2 = c("b","b","b"), var_3 = c("c","d","e"))

Testdata %>% 
  summarise(across(everything(), n_distinct)) %>%
  pivot_longer(everything())
#> # A tibble: 3 × 2
#>   name  value
#>   <chr> <int>
#> 1 var_1     1
#> 2 var_2     1
#> 3 var_3     3

Created on 2023-08-02 with reprex v2.0.2

Upvotes: 0

Henrik
Henrik

Reputation: 67778

collapse::fNdistinct takes a data.frame.

library(collapse)
fNdistinct(Testdata)
# var_1 var_2 var_3 
#     1     1     3

And it is fast. Data with 10000 rows and 10000 columns:

Testdata = data.frame(replicate(1e4, sample(letters[1:sample(26, 1)], 1e4, replace = TRUE)))
system.time(fNdistinct(Testdata))
# user  system elapsed 
# 0.38    0.00    0.37

Upvotes: 1

zx8754
zx8754

Reputation: 56149

Using the lengthsfunction:

lengths(lapply(Testdata, unique))

# var_1 var_2 var_3 
#     1     1     3 

Upvotes: 8

leerssej
leerssej

Reputation: 14958

In dplyr:

Testdata %>% summarise_all(n_distinct)

🙂

( For those curious about the complete syntax.

In dplyr >0.8.0 using purrr syntax:

Testdata %>% summarise_all(list(~n_distinct(.)))

In dplyr <0.8.0:

Testdata %>% summarise_all(funs(n_distinct(.)))

)

For more information on summarizing multiple columns found here: https://dplyr.tidyverse.org/reference/summarise_all.html

Upvotes: 33

Vinay
Vinay

Reputation: 253

library(purrr)
Testdata %>% map_dbl(n_distinct)
var_1 var_2 var_3 
    1     1     3 

# in your format
Testdata %>% map_dbl(n_distinct)%>%melt(value.name = "unique_counts")
      unique_counts
var_1             1
var_2             1
var_3             3

Upvotes: 0

sgibb
sgibb

Reputation: 25736

You could use apply:

apply(Testdata, 2, function(x) length(unique(x)))
# var_1 var_2 var_3 
#     1     1     3

Upvotes: 38

user6376316
user6376316

Reputation:

I just tried all solution and two of the above solutions did not work one with aggregate and the tidyr ones but two of them using did not work. I think using a data table is a good choice ,

setDT(Testdata)[, lapply(.SD, uniqueN), .SDcols=c("var_1","var_2","var_3")]
   #    var_1 var_2 var_3
   # 1:     1     1     3

I tried to compare them from each other

library(microbenchmark)
Mycomp = microbenchmark(
  apply = apply(Testdata, 2, function(x)length(unique(x))),
  lapply = lapply(Testdata, function(x)length(unique(x))),
  sapply = sapply(Testdata, function(x)length(unique(x))),
  #base = aggregate(values ~ ind, unique(stack(Testdata)), length),
  datatable = setDT(Testdata)[, lapply(.SD, uniqueN), .SDcols=c("var_1","var_2","var_3")],
  times=50
)

#Unit: microseconds
#      expr     min      lq     mean   median      uq     max neval cld
#     apply 163.315 176.678 192.0435 181.7915 192.047 608.859    50  b 
#    lapply 138.217 147.339 157.9684 153.0640 165.829 254.145    50 a  
#    sapply 160.338 169.124 178.1486 174.3965 185.548 203.419    50  b 
# datatable 667.937 684.650 698.1306 696.0160 703.390 874.073    50   c

Upvotes: 1

Megatron
Megatron

Reputation: 17089

Here, I've used dplyr and tidyr to count (using your Testdata data frame):

Testdata %>% 
  gather(var, value) %>% 
  distinct() %>% 
  count(var)

# # A tibble: 3 × 2
#     var     n
#   <chr> <int>
# 1 var_1     1
# 2 var_2     1
# 3 var_3     3

Upvotes: 4

Tapajit Dey
Tapajit Dey

Reputation: 1396

This is actually an improvement on the comment by @Ananda Mahto. It didn't fit in the comment so I decided to add as an answer.

sapply is actually marginally faster than lapply, and gives the output in a more compact form, just like the output from apply.

A test run result on actual data:

> start <- Sys.time()
> apply(datafile, 2, function(x)length(unique(x)))
          symbol.           date     volume 
             1371            261      53647 
> Sys.time() - start
Time difference of 1.619567 secs
> 
> start <- Sys.time()
> lapply(datafile, function(x)length(unique(x)))
$symbol.
[1] 1371

$date
[1] 261

$volume
[1] 53647

> Sys.time() - start
Time difference of 0.07129478 secs
> 
> start <- Sys.time()
> sapply(datafile, function(x)length(unique(x)))
          symbol.              date             volume 
             1371               261              53647 
> Sys.time() - start
Time difference of 0.06939292 secs

The datafile has around 3.5 million rows.

Quoting the help text:

sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).

Upvotes: 6

Matthew Plourde
Matthew Plourde

Reputation: 44614

Here's an alternative:

aggregate(values ~ ind, unique(stack(Testdata)), length)
#     ind values
# 1 var_1      1
# 2 var_2      1
# 3 var_3      3

This requires the columns be character.

Upvotes: 1

Related Questions