Kryo
Kryo

Reputation: 933

Frequency of unique character

Im trying to find the number of times a unique gene is found in samples with its respective pvalue

df1 <-  read.table(text="
        Gene        id           Seg.mean    pValue    CNA
         Nfib       8410          0.3108     1.381913 gain
         Mycl       8410          2.7320     1.182842 gain
         Mycl       8410          2.7320     1.846275 gain
         Nfib       8411          0.5920     1.381913 gain
         Nfib       8411          1.3090     1.381913 gain
         Mycl       8412          1.6150     5.765442 gain
         Mycl       8411          1.6150     1.846275 gain
",header=TRUE)

expected output

Gene    ID           Freq. of id   pValue
Nfib    8410,8411        2           1.381913
Mycl    8410,8411,8412   3           1.182842,1.846275,5.765442

Upvotes: 3

Views: 88

Answers (3)

Prasanna Nandakumar
Prasanna Nandakumar

Reputation: 4335

library(plyr)
> ddply(data.frame(df1), .(Gene), summarise,ID=paste(unique(id), collapse=","),pValue=paste(unique(pValue), collapse=","),Freq = length(unique(id)))
  Gene             ID                     pValue Freq
1 Mycl 8410,8412,8411 1.182842,1.846275,5.765442    3
2 Nfib      8410,8411                   1.381913    2

Upvotes: 1

mucio
mucio

Reputation: 7119

I think you can use data.table to get very close to the result you want to achieve:

library(data.table)

df1<-data.table(df1)
df1[,
list(ID = paste(unique(id), collapse=','),
     "Freq. of id"=length(unique(id)), 
     pValue=paste(unique(pValue), collapse=",")),
keyby=list(Gene)]

Upvotes: 1

npjc
npjc

Reputation: 4194

sol'n:

library(dplyr)

df1 %>% 
  group_by(Gene) %>% 
  summarise(ID = paste0(unique(id), collapse=", "),
            pval = paste0(unique(pValue),collapse=", "), 
            n = n_distinct(id))

result:

  Gene               ID                         pval n
1 Mycl 8410, 8412, 8411 1.182842, 1.846275, 5.765442 3
2 Nfib       8410, 8411                     1.381913 2

breakdown:

  1. we want to evaluate on Gene (unit of analysis) and so group_by(Gene).
  2. then create new variables which correspond to paste0(var,collapse=", "). This is applied per Gene.
  3. count the number of distinct ids. Again applied per Gene.

Upvotes: 2

Related Questions