Reputation: 15

Aggregate rows with a common value retaining unique values

I've tried to do the following starting from this data frame

    Chr                Gene.Symbols
2  chr1                       GSTM1
3  chr2                     MIR4432
4  chr2                      BCL11A
5  chr2                      PAPOLG
6  chr2                   LINC01185
7  chr2                   LINC01185
8  chr2              LINC01185, REL
9  chr2                         REL
10 chr2                         REL
11 chr2                         REL
12 chr2                         REL
13 chr2                            
14 chr2                       PUS10
15 chr2             PEX13, KIAA1841

I would like to have this result:

    Chr             Gene.Symbols
2  chr1             GSTM1
3  chr2             MIR4432,BCL11A,PAPOLG,LINC01185,REL,PUS10,PEX13,KIAA1841

I've managed to aggregate the gene symbols together using:

aggregate(Gene.Symbols~Chr, data, paste, collapse = ",")

that I learned from other questions like this one, but I wasn't able to remove duplicates.

Can someone help me, please?

UPDATE: I also need a file with only the genes names one per row (without the "Chr" column). How can I traspose the gene names? I am starting now with a file with as many rows as Chr and each row one has several genes in the Gene.Symbols column.

Upvotes: 1

Answers (4)

Jaap

Reputation: 83235

Even another option:

library(splitstackshape) # automatically loads the 'data.table'-package
cSplit(mydf, 'Gene.Symbols', sep = ','
       , direction = 'long')[, .(Gene.Symbols = toString(unique(Gene.Symbols)))
                             , by = Chr]

which gives:

    Chr                                                    Gene.Symbols
1: chr1                                                           GSTM1
2: chr2 MIR4432, BCL11A, PAPOLG, LINC01185, REL, PUS10, PEX13, KIAA1841

Upvotes: 3

Sotos

Reputation: 51592

An idea via base R in two steps,

dd <- aggregate(Gene.Symbols ~ Chr, df, paste, collapse = ', ')

dd$Gene.Symbols <- sapply(strsplit(dd$Gene.Symbols, ", "), function(i) 
                                                    paste(unique(i), collapse = ","))

which gives,

#   Chr                                              Gene.Symbols
#1 chr1                                                     GSTM1
#2 chr2 MIR4432,BCL11A,PAPOLG,LINC01185,REL,,PUS10,PEX13,KIAA1841

A one-liner (compliments of @Cath) would be,

aggregate(Gene.Symbols ~ Chr, df, function(gene) 
                              paste(unique(unlist(strsplit(gene, ", "))), collapse = ','))

Upvotes: 3

BENY

Reputation: 323306

By using dplyr and tidyr

#1st unnest your string
df=df %>%
    transform( Gene.Symbols = strsplit( Gene.Symbols, ",")) %>%
    unnest( Gene.Symbols)
# then group by 
df%>%group_by(Chr)%>%summarise(Gene.Symbols=toString(unique(Gene.Symbols)))

# A tibble: 2 x 2
    Chr                                                           Gene.Symbols
  <chr>                                                                  <chr>
1  chr1                                                                  GSTM1
2  chr2       MIR4432, BCL11A, PAPOLG, LINC01185, REL, PUS10, PEX13,  KIAA1841

Upvotes: 2

Nathan Werth

Reputation: 5263

collapse_unique <- function(x) {
    paste(unique(x), collapse = ",")
}

aggregate(Gene.Symbols~Chr, data, collapse_unique)

Upvotes: 2

Aggregate rows with a common value retaining unique values

Answers (4)

Related Questions