shu251
shu251

Reputation: 251

tidyr package in R, using gather() "Invalid column specification"

I'm still learning how to use tidyr. I'd like to use "gather()" to make columns into multiple rows, and preserve the "gene_ID" column by copying it where applicable. Example input data:

    gene_ID path1   path2   path3   path4   path5   path6   path7   path8
CAMNT_0043146643    RNA transport                           
CAMNT_0029561721    Ribosome                            
CAMNT_0024703307    Sphingolipid signaling pathway  Lysosome                        
CAMNT_0020981363    mRNA surveillance pathway   Hippo signaling pathway cAMP signaling pathway  cGMP - PKG signaling pathway    Regulation of actin cytoskeleton    Meiosis - yeast Oocyte meiosis  Focal adhesion
CAMNT_0020021387    Spliceosome Protein processing in endoplasmic reticulum MAPK signaling pathway  Endocytosis             
CAMNT_0003293445    Spliceosome Protein processing in endoplasmic reticulum MAPK signaling pathway  Endocytosis             

Example of desired output data:

gene_ID Pathway
CAMNT_0043146643    RNA transport
CAMNT_0029561721    Ribosome
CAMNT_0024703307    Lysosome
CAMNT_0024703307    Sphingolipid signaling pathway
CAMNT_0020981363    mRNA surveillance pathway
CAMNT_0020981363    Hippo signaling pathway
CAMNT_0020981363    cAMP signaling pathway
CAMNT_0020981363    cGMP - PKG signaling pathway
CAMNT_0020981363    Regulation of actin cytoskeleton
CAMNT_0020981363    Meiosis - yeast
CAMNT_0020981363    Oocyte meiosis
CAMNT_0020981363    Focal adhesion
CAMNT_0020021387    Spliceosome
CAMNT_0020021387    Protein processing in endoplasmic reticulum
CAMNT_0020021387    MAPK signaling pathway
CAMNT_0020021387    Endocytosis
CAMNT_0003293445    Spliceosome
CAMNT_0003293445    Protein processing in endoplasmic reticulum
CAMNT_0003293445    MAPK signaling pathway
CAMNT_0003293445    Endocytosis

Currently, I'm trying to do:

temp<-gather(extract,"gene_ID",path1:path8)

but I get an error message: "Error: Invalid column specification" I've tried this with and without headers for my input df, but the same error occurs. I'm open to using an alternate approach, but I've had issues with the "NAs" because not all row "gene_IDs" have the same number of columns.

Suggestions on how to proceed?

Upvotes: 1

Views: 3853

Answers (2)

davechilders
davechilders

Reputation: 9123

Here is a tidyr solution:

df %>%
  gather(path, Pathway, path1, path2) %>%
  filter(Pathway != "") %>%
  select(-path)

  x Pathway
1 a   test1
2 b   test1
3 c   test2
4 d   test2
5 e   test3
6 a   testa
7 c   testg
8 d   testd

Upvotes: 2

Ven Yao
Ven Yao

Reputation: 3710

df <- data.frame(x = c("a", "b", "c","d","e"),
                 path1=c("test1","test1","test2","test2","test3"),
                 path2=c("testa","","testg","testd",""))
library(reshape2)
df[df==""] <- NA
melt(df, id.vars="x", na.rm=T)
#   x variable value
# 1 a    path1 test1
# 2 b    path1 test1
# 3 c    path1 test2
# 4 d    path1 test2
# 5 e    path1 test3
# 6 a    path2 testa
# 8 c    path2 testg
# 9 d    path2 testd

Upvotes: 1

Related Questions