Reputation: 5719
I have this dataframe mydf
. I want to remove the duplicate items across column customer_sample_id
that are separated by comma and get the unique counts(new.freq
) as shown in the result.
mydf<- structure(list(count = c(6, 3, 3), customer_sample_id = c("AMLM12001KP ( chr2 : chr9 ),1028701 ( chr2 : chr9 ),1028701 ( chr2 : chr9 ),1220901 ( chr2 : chr9 ),AMLM12015WPS ( chr2 : chr9 ),AML203 ( chr2 : chr9 )",
"AMLM12001KP ( chr2 : chr20 ),1123801 ( chr2 : chr20 ),AMLM12020M-B ( chr2 : chr20 )",
"AMLM12001KP ( chr4 : chr17 ),AMLM12001KP ( chr4 : chr17 ),1031901 ( chr4 : chr17 )"
)), .Names = c("freq", "customer_sample_id"), row.names = c(1L,
2L, 3L), class = "data.frame")
result
new.freq uniq.customer_sample_id
1 5 AMLM12001KP ( chr2 : chr9 ),1028701 ( chr2 : chr9 ),1220901 ( chr2 : chr9 ),AMLM12015WPS ( chr2 : chr9 ),AML203 ( chr2 : chr9 )
2 3 AMLM12001KP ( chr2 : chr20 ),1123801 ( chr2 : chr20 ),AMLM12020M-B ( chr2 : chr20 )
3 2 AMLM12001KP ( chr4 : chr17 ),1031901 ( chr4 : chr17 )
Upvotes: 1
Views: 60
Reputation: 887118
We can use strsplit
res <- do.call(rbind,lapply(strsplit(mydf[,2], ','),
function(x) {
x1 <- unique(x)
data.frame(new.freq=length(x1), uniq.customer_sample_id=toString(x1))}))
res
#new.freq # uniq.customer_sample_id
#1 5 AMLM12001KP ( chr2 : chr9 ), 1028701 ( chr2 : chr9 ), 1220901 ( chr2 : chr9 ), AMLM12015WPS ( chr2 : chr9 ), AML203 ( chr2 : chr9 )
#2 3 AMLM12001KP ( chr2 : chr20 ), 1123801 ( chr2 : chr20 ), AMLM12020M-B ( chr2 : chr20 )
#3 2 #AMLM12001KP ( chr4 : chr17 ), 1031901 ( chr4 : chr17 )
Upvotes: 1