Alokin
Alokin

Reputation: 505

How to Count Unique rows in a data frame?

I have a data frame in R which has a lot of duplicate records. I am interested in finding out how many records of each are in this data frame.

For example, I have this data frame:

Fake Name       Fake ID    Fake Status   Fake Program
June             0003         Green        PR1
June             0003         Green        PR1
Television       202          Blue         PR3
Television       202          Green        PR3    
Television       202          Green        PR3
CRT              12           Red          PR0

And from the above I would want to get something that's like similar to below:

Fake Name       Fake ID    Fake Status   Fake Program     COUNT
June             0003         Green        PR1              2
Television       202          Blue         PR3              1
Television       202          Green        PR3              2
CRT              12           Red          PR0              1

Any help would be appreciated. Thank you.

Upvotes: 13

Views: 23485

Answers (6)

Rui Barradas
Rui Barradas

Reputation: 76402

The following uses duplicated to get the result data.frame and then rle to get the counts.

res <- dat[!duplicated(dat), ]

d <- duplicated(dat) | duplicated(dat, fromLast = TRUE)
res$COUNT <- rle(d)$lengths

res
#   Fake Name Fake ID Fake Status Fake Program COUNT
#1       June    0003       Green          PR1     2
#3 Television     202        Blue          PR3     1
#4 Television     202       Green          PR3     2
#6        CRT      12         Red          PR0     1

Edit

This answer only works for consecutive rows, see the discussion in comments, in particular the 2nd comment by user @Colombo (my emphasis).

The question title was about unique rows and the question text did not mention that they are consecutive. Other questions asking about counting unique rows are pointing to this. I arrived at the question searching for a similar problem. So due to this, just because the example data are consecutive doesn't mean that the general problem might be. If you prepended "Assuming the rows are consecutive", I would be fine with this. But as it stands out here, it does lead to confusion. Especially if many of the other answers are solving the general problem of counting unique rows.

Upvotes: 6

akuiper
akuiper

Reputation: 214957

Use group_by_all then count the number of rows with n:

df %>% group_by_all() %>% summarise(COUNT = n())

# A tibble: 4 x 5
# Groups:   Fake.Name, Fake.ID, Fake.Status [?]
#  Fake.Name  Fake.ID Fake.Status Fake.Program COUNT
#  <fct>        <int> <fct>       <fct>        <int>
#1 CRT             12 Red         PR0              1
#2 June             3 Green       PR1              2
#3 Television     202 Blue        PR3              1
#4 Television     202 Green       PR3              2

Or even better as from @Ryan's comment:

df %>% group_by_all %>% count

Update (2023-11-22)

In dplyr 1.1.0, the standard way is to use pick within count, like so:

count(df, pick(everything()))

Upvotes: 20

Ma&#235;l
Ma&#235;l

Reputation: 51994

You can use collapse::fcount. fcount is noticeably faster than any other options (see benchmark on another answer).

library(collapse)
fcount(data)

#    Fake Name Fake ID Fake Status Fake Program N
# 1       June    0003       Green          PR1 2
# 2 Television     202        Blue          PR3 1
# 3 Television     202       Green          PR3 2
# 4        CRT      12         Red          PR0 1

Upvotes: 2

Jack Novack
Jack Novack

Reputation: 41

you could use:

library(dplyr)
n_distinct(data$col)

Upvotes: 3

Joby D
Joby D

Reputation: 41

In base R, the table function provides tabular multi-way counts of every factor combination in your data frame. The result can then be converted to data frame that matches your original structure, with an added "Freq" column containing counts.

data.frame(table(df))

#    Fake.Name Fake.ID Fake.Status Fake.Program Freq
#1         CRT    0003        Blue          PR0    0
#2        June    0003        Blue          PR0    0
#3  Television    0003        Blue          PR0    0
#4         CRT      12        Blue          PR0    0

Of course, every combination might not be needed, so you can restrict it to the rows with positive counts:

subset(data.frame(table(df)), Freq > 0)

#    Fake.Name Fake.ID Fake.Status Fake.Program Freq
#22        CRT      12         Red          PR0    1
#38       June    0003       Green          PR1    2
#63 Television     202        Blue          PR3    1
#72 Television     202       Green          PR3    2

Upvotes: 4

Benjamin Christoffersen
Benjamin Christoffersen

Reputation: 4841

To the question

How to Count Unique rows in a data frame?

Then use sum and duplicated. E.g.,

df <- data.frame(
  `Fake Name` = c(
    "June", "June", "Television", "Television", "Television", "CRT"),
  `Fake ID` = c("0003", "0003", "202", "202", "202", "12"),
  `Fake Status` = c("Green", "Green", "Blue", "Green", "Green", "Red"),
  `Fake Program` = c("PR1", "PR1", "PR3", "PR3", "PR3", "PR0"), 
  check.names = FALSE)
df
#R    Fake Name Fake ID Fake Status Fake Program
#R 1       June    0003       Green          PR1
#R 2       June    0003       Green          PR1
#R 3 Television     202        Blue          PR3
#R 4 Television     202       Green          PR3
#R 5 Television     202       Green          PR3
#R 6        CRT      12         Red          PR0
sum(!duplicated(df))
#R [1] 4

For the table you request then you can use data.table as follows

library(data.table)
df <- data.table(df)
df[, .(COUNT = .N), by = names(df)]
#R     Fake Name Fake ID Fake Status Fake Program COUNT
#R 1:       June    0003       Green          PR1     2
#R 2: Television     202        Blue          PR3     1
#R 3: Television     202       Green          PR3     2
#R 4:        CRT      12         Red          PR0     1

Upvotes: 4

Related Questions