Reputation: 15
I have a data.frame that looks like this (however with a larger number of columns and rows):
I want to sum the rows that have all columns identical and create last column "count", in order to get something like this:
Thank you for help!
data:
structure(list(Gene = c("A", "A", "B", "C"), `Cell 1` = c(2,
2, 3, 4), `Cell 2` = c(2, 2, 3, 4), `Cell 3` = c(2, 2, 3, 4)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
>
Upvotes: 0
Views: 1171
Reputation: 134
Check this code
df=structure(list(Gene = c("A", "A", "B", "C"), `Cell 1` = c(2,
2, 3, 4), `Cell 2` = c(2, 2, 3, 4), `Cell 3` = c(2, 2, 3, 4)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
df1=unique(df) #Store Unique Rows
df2=df[duplicated(df),] #Store Duplicated Rows
df3=df1 #Copy unique Dataframe into new dataframe
df3['count']=1 #Create and assign with default value
for(i in 1:nrow(df2)) #Duplicated Rows
{
for(j in 1:nrow(df1)) #Unique Rows
{
if (all(df1[j,] == df2[i,])) #Check all columns data is same
{
df3[j,'count'] <- df3[j,'count']+1 # Increase count to one
}
}
}
Upvotes: -1
Reputation: 147
In SQL terms, you can count rows grouping by all columns and join the result with the initial data.frame.
I recommend using data.table package.
df=data.frame(a=c(1,1,2,3,4,4,4),b=c("a","a","b","b","e","e","f"))
library(data.table)
# convert df to data.table
df=as.data.table(df)
# aggregate df grouping by all columns
clmns=colnames(df)
row_multiplicity=df[,.N,by=clmns]
#join/merge with initial data.frame
new_df=merge(df,row_multiplicity)
Upvotes: 1
Reputation: 5138
There are several ways to do this, but here is a dplyr
solution that relies on all columns being identical to be added to the Count
column. This groups by all columns, adds Count
column with the length of each "group" (i.e., n()
), and then ungroups and removes duplicate rows using distinct()
library(dplyr)
df1 %>%
group_by(across(everything())) %>%
mutate(Count = n()) %>%
ungroup() %>%
distinct()
# A tibble: 3 x 5
Gene Cell_1 Cell_2 Cell_3 Count
<chr> <dbl> <dbl> <dbl> <int>
1 A 2 2 2 2
2 B 3 3 3 1
3 C 4 4 4 1
Or, a possible data.table
solution using the same logic:
library(data.table)
setDT(df1)
df1[, Count := .N, by = names(df1)]
unique(df1)
Or, a base solution substituting grouping with indexing data.frame-wide duplicates:
df1$Count = duplicated(df1) + 1
df1[!duplicated(df1[names(df1) != "Count"], fromLast = TRUE), ]
Data:
df1 = data.frame(Gene = c("A", "A", "B", "C"))
df1[paste0("Cell_", 1:3)] = c(2, 2:4)
Upvotes: 2
Reputation: 10375
Toy example, not the most elegant way
mtcars2=mtcars[c(1,1,2,3),]
do.call(rbind,
by(
mtcars2,
mtcars2,
function(x){
data.frame(unique(x),"Count"=nrow(x))
})
)
mpg cyl disp hp drat wt qsec vs am gear carb Count
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
Edit: OP provided data
df=structure(list(Gene = c("A", "A", "B", "C"), `Cell 1` = c(2,
2, 3, 4), `Cell 2` = c(2, 2, 3, 4), `Cell 3` = c(2, 2, 3, 4)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
do.call(rbind,
by(df,
df,
function(x){
data.frame(unique(x),"Count"=nrow(x))
}
)
)
Gene Cell.1 Cell.2 Cell.3 Count
1 A 2 2 2 2
3 B 3 3 3 1
4 C 4 4 4 1
Upvotes: 2
Reputation: 11584
Using dplyr package:
> library(dplyr)
> df %>% add_count(Gene, name = 'Count') %>% group_by(Gene) %>% filter(row_number() == 1)
# A tibble: 3 x 5
# Groups: Gene [3]
Gene `Cell 1` `Cell 2` `Cell 3` Count
<chr> <dbl> <dbl> <dbl> <int>
1 A 2 2 2 2
2 B 3 3 3 1
3 C 4 4 4 1
>
Data used:
structure(list(Gene = c("A", "A", "B", "C"), `Cell 1` = c(2,
2, 3, 4), `Cell 2` = c(2, 2, 3, 4), `Cell 3` = c(2, 2, 3, 4)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
>
Upvotes: 0