Reputation: 505
I have a data frame in R which has a lot of duplicate records. I am interested in finding out how many records of each are in this data frame.
For example, I have this data frame:
Fake Name Fake ID Fake Status Fake Program
June 0003 Green PR1
June 0003 Green PR1
Television 202 Blue PR3
Television 202 Green PR3
Television 202 Green PR3
CRT 12 Red PR0
And from the above I would want to get something that's like similar to below:
Fake Name Fake ID Fake Status Fake Program COUNT
June 0003 Green PR1 2
Television 202 Blue PR3 1
Television 202 Green PR3 2
CRT 12 Red PR0 1
Any help would be appreciated. Thank you.
Upvotes: 13
Views: 23485
Reputation: 76402
The following uses duplicated
to get the result data.frame and then rle
to get the counts.
res <- dat[!duplicated(dat), ]
d <- duplicated(dat) | duplicated(dat, fromLast = TRUE)
res$COUNT <- rle(d)$lengths
res
# Fake Name Fake ID Fake Status Fake Program COUNT
#1 June 0003 Green PR1 2
#3 Television 202 Blue PR3 1
#4 Television 202 Green PR3 2
#6 CRT 12 Red PR0 1
This answer only works for consecutive rows, see the discussion in comments, in particular the 2nd comment by user @Colombo (my emphasis).
The question title was about unique rows and the question text did not mention that they are consecutive. Other questions asking about counting unique rows are pointing to this. I arrived at the question searching for a similar problem. So due to this, just because the example data are consecutive doesn't mean that the general problem might be. If you prepended "Assuming the rows are consecutive", I would be fine with this. But as it stands out here, it does lead to confusion. Especially if many of the other answers are solving the general problem of counting unique rows.
Upvotes: 6
Reputation: 214957
Use group_by_all
then count the number of rows with n
:
df %>% group_by_all() %>% summarise(COUNT = n())
# A tibble: 4 x 5
# Groups: Fake.Name, Fake.ID, Fake.Status [?]
# Fake.Name Fake.ID Fake.Status Fake.Program COUNT
# <fct> <int> <fct> <fct> <int>
#1 CRT 12 Red PR0 1
#2 June 3 Green PR1 2
#3 Television 202 Blue PR3 1
#4 Television 202 Green PR3 2
Or even better as from @Ryan's comment:
df %>% group_by_all %>% count
Update (2023-11-22)
In dplyr 1.1.0
, the standard way is to use pick
within count
, like so:
count(df, pick(everything()))
Upvotes: 20
Reputation: 51994
You can use collapse::fcount
. fcount
is noticeably faster than any other options (see benchmark on another answer).
library(collapse)
fcount(data)
# Fake Name Fake ID Fake Status Fake Program N
# 1 June 0003 Green PR1 2
# 2 Television 202 Blue PR3 1
# 3 Television 202 Green PR3 2
# 4 CRT 12 Red PR0 1
Upvotes: 2
Reputation: 41
In base R, the table
function provides tabular multi-way counts of every factor combination in your data frame. The result can then be converted to data frame that matches your original structure, with an added "Freq" column containing counts.
data.frame(table(df))
# Fake.Name Fake.ID Fake.Status Fake.Program Freq
#1 CRT 0003 Blue PR0 0
#2 June 0003 Blue PR0 0
#3 Television 0003 Blue PR0 0
#4 CRT 12 Blue PR0 0
Of course, every combination might not be needed, so you can restrict it to the rows with positive counts:
subset(data.frame(table(df)), Freq > 0)
# Fake.Name Fake.ID Fake.Status Fake.Program Freq
#22 CRT 12 Red PR0 1
#38 June 0003 Green PR1 2
#63 Television 202 Blue PR3 1
#72 Television 202 Green PR3 2
Upvotes: 4
Reputation: 4841
To the question
How to Count Unique rows in a data frame?
Then use sum
and duplicated
. E.g.,
df <- data.frame(
`Fake Name` = c(
"June", "June", "Television", "Television", "Television", "CRT"),
`Fake ID` = c("0003", "0003", "202", "202", "202", "12"),
`Fake Status` = c("Green", "Green", "Blue", "Green", "Green", "Red"),
`Fake Program` = c("PR1", "PR1", "PR3", "PR3", "PR3", "PR0"),
check.names = FALSE)
df
#R Fake Name Fake ID Fake Status Fake Program
#R 1 June 0003 Green PR1
#R 2 June 0003 Green PR1
#R 3 Television 202 Blue PR3
#R 4 Television 202 Green PR3
#R 5 Television 202 Green PR3
#R 6 CRT 12 Red PR0
sum(!duplicated(df))
#R [1] 4
For the table you request then you can use data.table
as follows
library(data.table)
df <- data.table(df)
df[, .(COUNT = .N), by = names(df)]
#R Fake Name Fake ID Fake Status Fake Program COUNT
#R 1: June 0003 Green PR1 2
#R 2: Television 202 Blue PR3 1
#R 3: Television 202 Green PR3 2
#R 4: CRT 12 Red PR0 1
Upvotes: 4