Sabor117
Sabor117

Reputation: 135

Fast method in R of obtaining a value from one column based on searched value in another?

I have two data frames, one of which has a large list of two identifiers:

rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A

And one which contains one of the two identifiers:

V1 V2 V3 V4 V5 V6
1    1_10439_A_AC  0 10439  A    AC
1    1_13417_CGAGA_C  0 13417  C CGAGA
1    1_14907_G_A  0 14907  G     A

What I want is to replace the ID in the second dataframe with the corresponding second ID from the first dataframe (I also couldn't think of a succinct way of phrasing that for the title of this question hence why it's phrased so awkwardly and why I might not have been able to find duplicates). I.e.:

V1 V2 V3 V4 V5 V6
1    1_10439_A_AC  0 10439  A    AC
1    1_13417_CGAGA_C  0 13417  C CGAGA
1    rs6682375  0 14907  G     A

My solution at present is to use a for ... if loop as follows:

for (x in 1:nrow(df2)){

    if (df2$V2[x] %in% df1$uniq_id){

        df2$V2[x] = df1$rsid[x]
    }
}

However, because both files are extremely large, I believe that this is likely a very inefficient way of doing this and am wondering if there is a faster method.

Someone suggested that using the match() function might be quicker, but given that the R documentation for this suggests that %in% is actually more intuitive and my inexperience with it, I'm not sure how to apply it in a different way.

Any help appreciated.

Upvotes: 1

Views: 53

Answers (2)

Adam Sampson
Adam Sampson

Reputation: 2011

A method using base r. It would also be easy to perform this using dplyr and it's left_join function if desired.

df <- data.table::fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- data.table::fread('
V1 V2 V3 V4 V5 V6
1    1_10439_A_AC  0 10439  A    AC
1    1_13417_CGAGA_C  0 13417  C CGAGA
1    1_14907_G_A  0 14907  G     A
')

df2 <- merge(df2,df,by.x = c("V2"),by.y = c("uniq_id"),all.x = TRUE)
df2$V2 <- ifelse(!is.na(df2$rsid),df2$rsid,df2$V2)
df2$rsid <- NULL

df2

#                 V2 V1 V3    V4 V5    V6
# 1:    1_10439_A_AC  1  0 10439  A    AC
# 2: 1_13417_CGAGA_C  1  0 13417  C CGAGA
# 3:       rs6682375  1  0 14907  G     A

Upvotes: 1

IceCreamToucan
IceCreamToucan

Reputation: 28675

This is an update-join, in data.table terminology. Assuming the first table is called df and the second is called df2:

library(data.table)
setDT(df)
setDT(df2)

df2[df, on = .(V2 = uniq_id), V2 := rsid]

df2
#    V1              V2 V3    V4 V5    V6
# 1:  1    1_10439_A_AC  0 10439  A    AC
# 2:  1 1_13417_CGAGA_C  0 13417  C CGAGA
# 3:  1       rs6682375  0 14907  G     A

Data used

df <- fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- fread('
V1 V2 V3 V4 V5 V6
1    1_10439_A_AC  0 10439  A    AC
1    1_13417_CGAGA_C  0 13417  C CGAGA
1    1_14907_G_A  0 14907  G     A
')

Upvotes: 2

Related Questions