Reputation: 15
I've tried to do the following starting from this data frame
Chr Gene.Symbols
2 chr1 GSTM1
3 chr2 MIR4432
4 chr2 BCL11A
5 chr2 PAPOLG
6 chr2 LINC01185
7 chr2 LINC01185
8 chr2 LINC01185, REL
9 chr2 REL
10 chr2 REL
11 chr2 REL
12 chr2 REL
13 chr2
14 chr2 PUS10
15 chr2 PEX13, KIAA1841
I would like to have this result:
Chr Gene.Symbols
2 chr1 GSTM1
3 chr2 MIR4432,BCL11A,PAPOLG,LINC01185,REL,PUS10,PEX13,KIAA1841
I've managed to aggregate the gene symbols together using:
aggregate(Gene.Symbols~Chr, data, paste, collapse = ",")
that I learned from other questions like this one, but I wasn't able to remove duplicates.
Can someone help me, please?
UPDATE: I also need a file with only the genes names one per row (without the "Chr" column). How can I traspose the gene names? I am starting now with a file with as many rows as Chr and each row one has several genes in the Gene.Symbols column.
Upvotes: 1
Views: 899
Reputation: 83235
Even another option:
library(splitstackshape) # automatically loads the 'data.table'-package
cSplit(mydf, 'Gene.Symbols', sep = ','
, direction = 'long')[, .(Gene.Symbols = toString(unique(Gene.Symbols)))
, by = Chr]
which gives:
Chr Gene.Symbols 1: chr1 GSTM1 2: chr2 MIR4432, BCL11A, PAPOLG, LINC01185, REL, PUS10, PEX13, KIAA1841
Upvotes: 3
Reputation: 51592
An idea via base R in two steps,
dd <- aggregate(Gene.Symbols ~ Chr, df, paste, collapse = ', ')
dd$Gene.Symbols <- sapply(strsplit(dd$Gene.Symbols, ", "), function(i)
paste(unique(i), collapse = ","))
which gives,
# Chr Gene.Symbols
#1 chr1 GSTM1
#2 chr2 MIR4432,BCL11A,PAPOLG,LINC01185,REL,,PUS10,PEX13,KIAA1841
A one-liner (compliments of @Cath) would be,
aggregate(Gene.Symbols ~ Chr, df, function(gene)
paste(unique(unlist(strsplit(gene, ", "))), collapse = ','))
Upvotes: 3
Reputation: 323306
By using dplyr
and tidyr
#1st unnest your string
df=df %>%
transform( Gene.Symbols = strsplit( Gene.Symbols, ",")) %>%
unnest( Gene.Symbols)
# then group by
df%>%group_by(Chr)%>%summarise(Gene.Symbols=toString(unique(Gene.Symbols)))
# A tibble: 2 x 2
Chr Gene.Symbols
<chr> <chr>
1 chr1 GSTM1
2 chr2 MIR4432, BCL11A, PAPOLG, LINC01185, REL, PUS10, PEX13, KIAA1841
Upvotes: 2
Reputation: 5263
collapse_unique <- function(x) {
paste(unique(x), collapse = ",")
}
aggregate(Gene.Symbols~Chr, data, collapse_unique)
Upvotes: 2