Reputation: 4309
My dataset looks like this
library(dyplr)
dta = rbind(c(1,'F', 0),
c(1,'F', 0),
c(1,'F', 0),
c(2,'F', 1),
c(2,'F', 1),
c(3,'F', 1),
c(3,'F', 1),
c(3,'F', 1),
c(4,'M', 1),
c(4,'M', 1),
c(5,'M', 1),
c(6,'M', 0)
)
colnames(dta) <- c('id', 'sex', 'child')
dta = as.data.frame(dta)
So the data are long-format with id as personal identifier.
My problem is when I try to count the sex for example, I dont have the right count because of the repetition of the id.
So there is 3 females and 3 males.
but when I count I have
dta %>%
group_by(sex) %>%
summarise(n())
8 and 4 - because it counted the rows and not the unique id
Same problem with the cross table
dta %>%
group_by(sex, child) %>%
summarise(n())
How do I indicate the unique identifier (n_distinct
) in the count ?
Upvotes: 3
Views: 8301
Reputation: 13580
Base package:
aggregate(id ~ sex, dta, function(x) length(unique(x)))
Output:
sex id
1 F 3
2 M 3
Another alternative with dplyr
:
library(dplyr)
count_(unique(dta), vars = "sex")
Output:
Source: local data frame [2 x 2]
sex n
1 F 3
2 M 3
Using sqldf
:
library(sqldf)
sqldf("SELECT sex, COUNT(DISTINCT(id)) AS n
FROM dta GROUP BY sex")
Output:
sex n
1 F 3
2 M 3
Upvotes: 2
Reputation: 24945
There are a ton of different ways to do this, here's one:
dta %>% distinct(id) %>%
group_by(sex) %>%
summarise(n())
EDIT: After some discussion, let's test how quick varying methods work.
First, some larger data:
dta <- data.frame(id = rep(1:500, 30),
sex = rep (c("M", "F"), 750),
child = rep(c(1, 0, 0, 1), 375))
Now let's run our varying methods:
library(microbenchmark)
microbenchmark(
distinctcount = dta %>% distinct(id) %>% count(sex),
uniquecount = dta %>% unique %>% count(sex),
distinctsummarise = dta %>% distinct(id) %>% group_by(sex) %>% summarise(n()),
uniquesummarise = dta %>% unique %>% group_by(sex) %>% summarise(n()),
distincttally= dta %>% distinct(id) %>% group_by(sex) %>% tally
)
On my machine:
Unit: milliseconds
expr min lq mean median uq max neval
distinctcount 1.576307 1.602803 1.664385 1.630643 1.670195 2.233710 100
uniquecount 32.391659 32.885479 33.194082 33.072485 33.244516 35.734735 100
distinctsummarise 1.724914 1.760817 1.815123 1.792114 1.830513 2.178798 100
uniquesummarise 32.757609 33.080933 33.490001 33.253155 33.463010 39.937194 100
distincttally 1.618547 1.656947 1.715741 1.685554 1.731058 2.383084 100
We can see unique works pretty poorly on larger data, so the quickest is:
dta %>% distinct(id) %>% count(sex)
Upvotes: 3