Reputation: 117

Counting Number of Times Each Row is Duplicated in R

In my dataset, I want to count the number of times each row appears in my dataset, which consists of five columns. I tried using table; however, this seems to only work with seeing how many times one column, not multiple, is duplicated since I get the error

attempt to make a table with >= 2^31 elements

As a quick example, say my dataframe is as follows:

dat <- data.frame(
SSN = c(204,401,204,666,401), 
Name=c("Blossum","Buttercup","Blossum","MojoJojo","Buttercup"), 
Age = c(7,8,7,43,8), 
Gender = c(0,0,0,1,0)
)

How do I add another column with how many times each row appears in this dataframe?

Upvotes: 3

Answers (4)

Billy34

Reputation: 2214

Using data.table package. setDT is used to inplace transform data.frame into a data.table.

Inplace (:=) modification of dat by adding count (.N) of lines grouped by all columns of dat (by=names(dat)).

Note: inplace modification result is invisible. So you need to explicitly print it or add [] after (dat[, ...][]).

setDT(dat)
dat[,by=names(dat),N:=.N][]
#>    SSN      Name Age Gender N
#> 1: 204   Blossum   7      0 2
#> 2: 401 Buttercup   8      0 2
#> 3: 204   Blossum   7      0 2
#> 4: 666  MojoJojo  43      1 1
#> 5: 401 Buttercup   8      0 2

or (to collapse lines)

setDT(dat)
dat[,by=names(dat),.N]
#>    SSN      Name Age Gender N
#> 1: 204   Blossum   7      0 2
#> 2: 401 Buttercup   8      0 2
#> 3: 666  MojoJojo  43      1 1

Upvotes: 2

ThomasIsCoding

Reputation: 102700

I am not sure which is your desired output. Below are some base R options

> aggregate(
+   cnt ~ .,
+   cbind(dat, cnt = 1),
+   sum
+ )
  SSN      Name Age Gender cnt
1 204   Blossum   7      0   2
2 401 Buttercup   8      0   2
3 666  MojoJojo  43      1   1

> transform(
+   cbind(dat, n = 1),
+   n = ave(n, SSN, Name, Age, Gender, FUN = sum)
+ )
  SSN      Name Age Gender n
1 204   Blossum   7      0 2
2 401 Buttercup   8      0 2
3 204   Blossum   7      0 2
4 666  MojoJojo  43      1 1
5 401 Buttercup   8      0 2

Upvotes: 0

akrun

Reputation: 887881

We can use add_count without grouping as well

library(dplyr)
dat %>% 
   add_count(across(everything()))

-output

#  SSN      Name Age Gender n
#1 204   Blossum   7      0 2
#2 401 Buttercup   8      0 2
#3 204   Blossum   7      0 2
#4 666  MojoJojo  43      1 1
#5 401 Buttercup   8      0 2

Upvotes: 1

Gregor Thomas

Reputation: 146164

With dplyr, we could group by all columns:

dat %>%
  group_by(across(everything())) %>%
  mutate(n = n())
# # A tibble: 5 x 5
# # Groups:   SSN, Name, Age, Gender [3]
#     SSN Name        Age Gender     n
#   <dbl> <chr>     <dbl>  <dbl> <int>
# 1   204 Blossum       7      0     2
# 2   401 Buttercup     8      0     2
# 3   204 Blossum       7      0     2
# 4   666 MojoJojo     43      1     1
# 5   401 Buttercup     8      0     2

(mutate(n = n()) is has a shortcut, add_tally(), if you prefer. Use summarize(n = n() or count() if you want to collapse the data frame to the unique rows while adding counts)

Upvotes: 3

Counting Number of Times Each Row is Duplicated in R

Answers (4)

Related Questions