Reputation: 1193
I would like to return the count of the unique (distinct) values for every column in a data frame. For example, if I have the table:
Testdata <- data.frame(var_1 = c("a","a","a"), var_2 = c("b","b","b"), var_3 = c("c","d","e"))
var_1 | var_2 | var_3
a | b | c
a | b | d
a | b | e
I would like the output to be:
Variable | Unique_Values
var_1 | 1
var_2 | 1
var_3 | 3
I have tried playing around with loops using the unique function, e.g.
for(i in names(Testdata)){
# Code using unique function
}
However I suspect there is a simpler way.
Upvotes: 33
Views: 53777
Reputation: 1625
With more recent syntax using dplyr
:
library(tidyverse)
Testdata <- data.frame(var_1 = c("a","a","a"), var_2 = c("b","b","b"), var_3 = c("c","d","e"))
Testdata %>%
summarise(across(everything(), n_distinct)) %>%
pivot_longer(everything())
#> # A tibble: 3 × 2
#> name value
#> <chr> <int>
#> 1 var_1 1
#> 2 var_2 1
#> 3 var_3 3
Created on 2023-08-02 with reprex v2.0.2
Upvotes: 0
Reputation: 67778
collapse::fNdistinct
takes a data.frame
.
library(collapse)
fNdistinct(Testdata)
# var_1 var_2 var_3
# 1 1 3
And it is fast. Data with 10000 rows and 10000 columns:
Testdata = data.frame(replicate(1e4, sample(letters[1:sample(26, 1)], 1e4, replace = TRUE)))
system.time(fNdistinct(Testdata))
# user system elapsed
# 0.38 0.00 0.37
Upvotes: 1
Reputation: 56149
Using the lengths
function:
lengths(lapply(Testdata, unique))
# var_1 var_2 var_3
# 1 1 3
Upvotes: 8
Reputation: 14958
In dplyr
:
Testdata %>% summarise_all(n_distinct)
🙂
( For those curious about the complete syntax.
In dplyr >0.8.0
using purrr
syntax:
Testdata %>% summarise_all(list(~n_distinct(.)))
In dplyr <0.8.0
:
Testdata %>% summarise_all(funs(n_distinct(.)))
)
For more information on summarizing multiple columns found here: https://dplyr.tidyverse.org/reference/summarise_all.html
Upvotes: 33
Reputation: 253
library(purrr)
Testdata %>% map_dbl(n_distinct)
var_1 var_2 var_3
1 1 3
# in your format
Testdata %>% map_dbl(n_distinct)%>%melt(value.name = "unique_counts")
unique_counts
var_1 1
var_2 1
var_3 3
Upvotes: 0
Reputation: 25736
You could use apply
:
apply(Testdata, 2, function(x) length(unique(x)))
# var_1 var_2 var_3
# 1 1 3
Upvotes: 38
Reputation:
I just tried all solution and two of the above solutions did not work one with aggregate and the tidyr ones but two of them using did not work. I think using a data table is a good choice ,
setDT(Testdata)[, lapply(.SD, uniqueN), .SDcols=c("var_1","var_2","var_3")]
# var_1 var_2 var_3
# 1: 1 1 3
I tried to compare them from each other
library(microbenchmark)
Mycomp = microbenchmark(
apply = apply(Testdata, 2, function(x)length(unique(x))),
lapply = lapply(Testdata, function(x)length(unique(x))),
sapply = sapply(Testdata, function(x)length(unique(x))),
#base = aggregate(values ~ ind, unique(stack(Testdata)), length),
datatable = setDT(Testdata)[, lapply(.SD, uniqueN), .SDcols=c("var_1","var_2","var_3")],
times=50
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# apply 163.315 176.678 192.0435 181.7915 192.047 608.859 50 b
# lapply 138.217 147.339 157.9684 153.0640 165.829 254.145 50 a
# sapply 160.338 169.124 178.1486 174.3965 185.548 203.419 50 b
# datatable 667.937 684.650 698.1306 696.0160 703.390 874.073 50 c
Upvotes: 1
Reputation: 17089
Here, I've used dplyr
and tidyr
to count (using your Testdata
data frame):
Testdata %>%
gather(var, value) %>%
distinct() %>%
count(var)
# # A tibble: 3 × 2
# var n
# <chr> <int>
# 1 var_1 1
# 2 var_2 1
# 3 var_3 3
Upvotes: 4
Reputation: 1396
This is actually an improvement on the comment by @Ananda Mahto. It didn't fit in the comment so I decided to add as an answer.
sapply
is actually marginally faster than lapply
, and gives the output in a more compact form, just like the output from apply
.
A test run result on actual data:
> start <- Sys.time()
> apply(datafile, 2, function(x)length(unique(x)))
symbol. date volume
1371 261 53647
> Sys.time() - start
Time difference of 1.619567 secs
>
> start <- Sys.time()
> lapply(datafile, function(x)length(unique(x)))
$symbol.
[1] 1371
$date
[1] 261
$volume
[1] 53647
> Sys.time() - start
Time difference of 0.07129478 secs
>
> start <- Sys.time()
> sapply(datafile, function(x)length(unique(x)))
symbol. date volume
1371 261 53647
> Sys.time() - start
Time difference of 0.06939292 secs
The datafile
has around 3.5 million rows.
Quoting the help text:
sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).
Upvotes: 6
Reputation: 44614
Here's an alternative:
aggregate(values ~ ind, unique(stack(Testdata)), length)
# ind values
# 1 var_1 1
# 2 var_2 1
# 3 var_3 3
This requires the columns be character
.
Upvotes: 1