Jelica
Jelica

Reputation: 15

Aggregate and count identical rows in r

I have a data.frame that looks like this (however with a larger number of columns and rows): enter image description here

I want to sum the rows that have all columns identical and create last column "count", in order to get something like this: enter image description here

Thank you for help!

data:

structure(list(Gene = c("A", "A", "B", "C"), `Cell 1` = c(2, 
2, 3, 4), `Cell 2` = c(2, 2, 3, 4), `Cell 3` = c(2, 2, 3, 4)), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))
> 

Upvotes: 0

Views: 1171

Answers (5)

Samsani Hymavathi
Samsani Hymavathi

Reputation: 134

Check this code

df=structure(list(Gene = c("A", "A", "B", "C"), `Cell 1` = c(2, 
                                                          2, 3, 4), `Cell 2` = c(2, 2, 3, 4), `Cell 3` = c(2, 2, 3, 4)), row.names = c(NA, 
                                                                                                                                       -4L), class = c("tbl_df", "tbl", "data.frame"))

df1=unique(df)    #Store Unique Rows    
df2=df[duplicated(df),]    #Store Duplicated Rows

df3=df1    #Copy unique Dataframe into new dataframe    
df3['count']=1    #Create and assign with default value    


for(i in 1:nrow(df2))    #Duplicated Rows
  
{
  
  for(j in 1:nrow(df1))    #Unique Rows
    
  {
    
    if (all(df1[j,] == df2[i,]))    #Check all columns data is same
      
    {
      
      df3[j,'count'] <- df3[j,'count']+1   # Increase count to one
      
    }
    
    
  }
  
}

Upvotes: -1

Dimitrios Panagopoulos
Dimitrios Panagopoulos

Reputation: 147

In SQL terms, you can count rows grouping by all columns and join the result with the initial data.frame.

I recommend using data.table package.

df=data.frame(a=c(1,1,2,3,4,4,4),b=c("a","a","b","b","e","e","f"))

library(data.table)

# convert df to data.table
df=as.data.table(df)

# aggregate df grouping by all columns
clmns=colnames(df)
row_multiplicity=df[,.N,by=clmns]

#join/merge with initial data.frame
new_df=merge(df,row_multiplicity)

Upvotes: 1

Andrew
Andrew

Reputation: 5138

There are several ways to do this, but here is a dplyr solution that relies on all columns being identical to be added to the Count column. This groups by all columns, adds Count column with the length of each "group" (i.e., n()), and then ungroups and removes duplicate rows using distinct()

library(dplyr)

df1 %>%
  group_by(across(everything())) %>%
  mutate(Count = n()) %>%
  ungroup() %>%
  distinct()
# A tibble: 3 x 5
  Gene  Cell_1 Cell_2 Cell_3 Count
  <chr>  <dbl>  <dbl>  <dbl> <int>
1 A          2      2      2     2
2 B          3      3      3     1
3 C          4      4      4     1

Or, a possible data.table solution using the same logic:

library(data.table)

setDT(df1)
df1[, Count := .N, by = names(df1)]
unique(df1)

Or, a base solution substituting grouping with indexing data.frame-wide duplicates:

df1$Count = duplicated(df1) + 1
df1[!duplicated(df1[names(df1) != "Count"], fromLast = TRUE), ]

Data:

df1 = data.frame(Gene = c("A", "A", "B", "C"))
df1[paste0("Cell_", 1:3)] = c(2, 2:4)

Upvotes: 2

user2974951
user2974951

Reputation: 10375

Toy example, not the most elegant way

mtcars2=mtcars[c(1,1,2,3),]

do.call(rbind,
  by(
    mtcars2,
    mtcars2,
    function(x){
      data.frame(unique(x),"Count"=nrow(x))
    })
)

               mpg cyl disp  hp drat    wt  qsec vs am gear carb Count
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1     1
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4     2
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     1

Edit: OP provided data

df=structure(list(Gene = c("A", "A", "B", "C"), `Cell 1` = c(2, 
                                                          2, 3, 4), `Cell 2` = c(2, 2, 3, 4), `Cell 3` = c(2, 2, 3, 4)), row.names = c(NA, 
                                                                                                                                       -4L), class = c("tbl_df", "tbl", "data.frame"))
do.call(rbind,
  by(df,
     df,
     function(x){
       data.frame(unique(x),"Count"=nrow(x))
     }
  )
)

  Gene Cell.1 Cell.2 Cell.3 Count
1    A      2      2      2     2
3    B      3      3      3     1
4    C      4      4      4     1

Upvotes: 2

Karthik S
Karthik S

Reputation: 11584

Using dplyr package:

> library(dplyr)
> df %>% add_count(Gene, name = 'Count') %>% group_by(Gene) %>% filter(row_number() == 1)
# A tibble: 3 x 5
# Groups:   Gene [3]
  Gene  `Cell 1` `Cell 2` `Cell 3` Count
  <chr>    <dbl>    <dbl>    <dbl> <int>
1 A            2        2        2     2
2 B            3        3        3     1
3 C            4        4        4     1
> 

Data used:

structure(list(Gene = c("A", "A", "B", "C"), `Cell 1` = c(2, 
2, 3, 4), `Cell 2` = c(2, 2, 3, 4), `Cell 3` = c(2, 2, 3, 4)), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))
> 

Upvotes: 0

Related Questions