Reputation: 135
I have two data frames, one of which has a large list of two identifiers:
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
And one which contains one of the two identifiers:
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
What I want is to replace the ID in the second dataframe with the corresponding second ID from the first dataframe (I also couldn't think of a succinct way of phrasing that for the title of this question hence why it's phrased so awkwardly and why I might not have been able to find duplicates). I.e.:
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 rs6682375 0 14907 G A
My solution at present is to use a for ... if
loop as follows:
for (x in 1:nrow(df2)){
if (df2$V2[x] %in% df1$uniq_id){
df2$V2[x] = df1$rsid[x]
}
}
However, because both files are extremely large, I believe that this is likely a very inefficient way of doing this and am wondering if there is a faster method.
Someone suggested that using the match()
function might be quicker, but given that the R documentation for this suggests that %in%
is actually more intuitive and my inexperience with it, I'm not sure how to apply it in a different way.
Any help appreciated.
Upvotes: 1
Views: 53
Reputation: 2011
A method using base r. It would also be easy to perform this using dplyr
and it's left_join
function if desired.
df <- data.table::fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- data.table::fread('
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
')
df2 <- merge(df2,df,by.x = c("V2"),by.y = c("uniq_id"),all.x = TRUE)
df2$V2 <- ifelse(!is.na(df2$rsid),df2$rsid,df2$V2)
df2$rsid <- NULL
df2
# V2 V1 V3 V4 V5 V6
# 1: 1_10439_A_AC 1 0 10439 A AC
# 2: 1_13417_CGAGA_C 1 0 13417 C CGAGA
# 3: rs6682375 1 0 14907 G A
Upvotes: 1
Reputation: 28675
This is an update-join, in data.table terminology. Assuming the first table is called df
and the second is called df2
:
library(data.table)
setDT(df)
setDT(df2)
df2[df, on = .(V2 = uniq_id), V2 := rsid]
df2
# V1 V2 V3 V4 V5 V6
# 1: 1 1_10439_A_AC 0 10439 A AC
# 2: 1 1_13417_CGAGA_C 0 13417 C CGAGA
# 3: 1 rs6682375 0 14907 G A
Data used
df <- fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- fread('
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
')
Upvotes: 2