Reputation: 21
I'm trying to find common gene ID's in two dataframes. Both have the same unique identifier in the row (column A). Ideally I'd create a new data frame that retains the row name and simply places the gene expression data in columns. Below is a sample of my data (the column of interest is col 1 which is the identifier, and col 4:9 which I'll need to compare):
RefSeq. ID C1 C2 C3 C4 C5 C6
NP_000005 8.57345 8.45938 8.68941 8.35913 8.48177 8.44560
NP_000010 8.32595 8.19273 8.10708 8.48156 7.99014 8.24859
What I'd like perform is a match on the Refseq. ID column, matching similar unique identifiers for each row. I'd be comparing C1-C6 with both data frames.
I was able to at least view the matches with the following line of code:
> x008[, 1] %in% x007[, 1]
But that just returned a series of FALSE TRUE results for each match. Then I tried the following two lines of code but neither worked!?!
> mydata <- merge(x008, x007, by=c("RefSeq. ID"))
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
and
> match(x008$RefSeq. ID, x007$RefSeq. ID)
Error: unexpected symbol in "match(x008$RefSeq. ID"
Upvotes: 0
Views: 1259
Reputation: 50738
I can't quite reproduce your issue. The following works
merge(df1, df2, by = "RefSeq. ID")
# RefSeq. ID UniProt.x Protein.Name.x C1.x C2.x C3.x UniProt.y
#1 NP_000005 P01023 Alpha-2-macrogl 8.57345 8.45938 8.68941 P01023
#2 NP_000021 P21549 Serine--pyruva 9.67506 9.04974 8.92981 P21549
# Protein.Name.y C1.y C2.y C3.y
#1 Alpha-2-macrogl 18.57345 18.45938 18.68941
#2 Serine--pyruva 19.67506 19.04974 18.92981
"RefSeq. ID"
must be a unique column in both your data.frame
s.
df1 <- read.table(text =
"'RefSeq. ID' UniProt 'Protein Name' C1 C2 C3
NP_000005 P01023 Alpha-2-macrogl 8.57345 8.45938 8.68941
NP_000010 P24752 Acetyl-CoA 8.32595 8.19273 8.10708
NP_000021 P21549 Serine--pyruva 9.67506 9.04974 8.92981", header = T)
names(df1)[1] <- "RefSeq. ID"
df2 <- read.table(text =
"'RefSeq. ID' UniProt 'Protein Name' C1 C2 C3
NP_000005 P01023 Alpha-2-macrogl 18.57345 18.45938 18.68941
NP_000021 P21549 Serine--pyruva 19.67506 19.04974 18.92981", header = T)
names(df2)[1] <- "RefSeq. ID"
Upvotes: 2