Reputation: 251
I'm still learning how to use tidyr. I'd like to use "gather()" to make columns into multiple rows, and preserve the "gene_ID" column by copying it where applicable. Example input data:
gene_ID path1 path2 path3 path4 path5 path6 path7 path8
CAMNT_0043146643 RNA transport
CAMNT_0029561721 Ribosome
CAMNT_0024703307 Sphingolipid signaling pathway Lysosome
CAMNT_0020981363 mRNA surveillance pathway Hippo signaling pathway cAMP signaling pathway cGMP - PKG signaling pathway Regulation of actin cytoskeleton Meiosis - yeast Oocyte meiosis Focal adhesion
CAMNT_0020021387 Spliceosome Protein processing in endoplasmic reticulum MAPK signaling pathway Endocytosis
CAMNT_0003293445 Spliceosome Protein processing in endoplasmic reticulum MAPK signaling pathway Endocytosis
Example of desired output data:
gene_ID Pathway
CAMNT_0043146643 RNA transport
CAMNT_0029561721 Ribosome
CAMNT_0024703307 Lysosome
CAMNT_0024703307 Sphingolipid signaling pathway
CAMNT_0020981363 mRNA surveillance pathway
CAMNT_0020981363 Hippo signaling pathway
CAMNT_0020981363 cAMP signaling pathway
CAMNT_0020981363 cGMP - PKG signaling pathway
CAMNT_0020981363 Regulation of actin cytoskeleton
CAMNT_0020981363 Meiosis - yeast
CAMNT_0020981363 Oocyte meiosis
CAMNT_0020981363 Focal adhesion
CAMNT_0020021387 Spliceosome
CAMNT_0020021387 Protein processing in endoplasmic reticulum
CAMNT_0020021387 MAPK signaling pathway
CAMNT_0020021387 Endocytosis
CAMNT_0003293445 Spliceosome
CAMNT_0003293445 Protein processing in endoplasmic reticulum
CAMNT_0003293445 MAPK signaling pathway
CAMNT_0003293445 Endocytosis
Currently, I'm trying to do:
temp<-gather(extract,"gene_ID",path1:path8)
but I get an error message: "Error: Invalid column specification" I've tried this with and without headers for my input df, but the same error occurs. I'm open to using an alternate approach, but I've had issues with the "NAs" because not all row "gene_IDs" have the same number of columns.
Suggestions on how to proceed?
Upvotes: 1
Views: 3853
Reputation: 9123
Here is a tidyr
solution:
df %>%
gather(path, Pathway, path1, path2) %>%
filter(Pathway != "") %>%
select(-path)
x Pathway
1 a test1
2 b test1
3 c test2
4 d test2
5 e test3
6 a testa
7 c testg
8 d testd
Upvotes: 2
Reputation: 3710
df <- data.frame(x = c("a", "b", "c","d","e"),
path1=c("test1","test1","test2","test2","test3"),
path2=c("testa","","testg","testd",""))
library(reshape2)
df[df==""] <- NA
melt(df, id.vars="x", na.rm=T)
# x variable value
# 1 a path1 test1
# 2 b path1 test1
# 3 c path1 test2
# 4 d path1 test2
# 5 e path1 test3
# 6 a path2 testa
# 8 c path2 testg
# 9 d path2 testd
Upvotes: 1