Oars
Oars

Reputation: 21

R: Matching Common Rows (ID) in two dataframes

I'm trying to find common gene ID's in two dataframes. Both have the same unique identifier in the row (column A). Ideally I'd create a new data frame that retains the row name and simply places the gene expression data in columns. Below is a sample of my data (the column of interest is col 1 which is the identifier, and col 4:9 which I'll need to compare):

RefSeq. ID       C1      C2      C3      C4      C5      C6      
NP_000005   8.57345 8.45938 8.68941 8.35913 8.48177 8.44560 
NP_000010   8.32595 8.19273 8.10708 8.48156 7.99014 8.24859 

What I'd like perform is a match on the Refseq. ID column, matching similar unique identifiers for each row. I'd be comparing C1-C6 with both data frames.

I was able to at least view the matches with the following line of code:

> x008[, 1] %in% x007[, 1]

But that just returned a series of FALSE TRUE results for each match. Then I tried the following two lines of code but neither worked!?!

> mydata <- merge(x008, x007, by=c("RefSeq. ID"))
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

and

> match(x008$RefSeq. ID, x007$RefSeq. ID)
Error: unexpected symbol in "match(x008$RefSeq. ID"

Upvotes: 0

Views: 1259

Answers (1)

Maurits Evers
Maurits Evers

Reputation: 50738

I can't quite reproduce your issue. The following works

merge(df1, df2, by = "RefSeq. ID")
#  RefSeq. ID UniProt.x  Protein.Name.x    C1.x    C2.x    C3.x UniProt.y
#1  NP_000005    P01023 Alpha-2-macrogl 8.57345 8.45938 8.68941    P01023
#2  NP_000021    P21549  Serine--pyruva 9.67506 9.04974 8.92981    P21549
# Protein.Name.y     C1.y     C2.y     C3.y
#1 Alpha-2-macrogl 18.57345 18.45938 18.68941
#2  Serine--pyruva 19.67506 19.04974 18.92981

"RefSeq. ID" must be a unique column in both your data.frames.


Sample data

df1 <- read.table(text =
    "'RefSeq. ID'  UniProt 'Protein Name'    C1      C2      C3
NP_000005   P01023  Alpha-2-macrogl 8.57345 8.45938 8.68941
NP_000010   P24752  Acetyl-CoA      8.32595 8.19273 8.10708
NP_000021   P21549  Serine--pyruva  9.67506 9.04974 8.92981", header = T)
names(df1)[1] <- "RefSeq. ID"

df2 <- read.table(text =
    "'RefSeq. ID'  UniProt 'Protein Name'    C1      C2      C3
NP_000005   P01023  Alpha-2-macrogl 18.57345 18.45938 18.68941
NP_000021   P21549  Serine--pyruva  19.67506 19.04974 18.92981", header = T)
names(df2)[1] <- "RefSeq. ID"

Upvotes: 2

Related Questions