D. Fowler
D. Fowler

Reputation: 635

merging and filling the NA values of another column based on another dataframe

I have 2 dfs, a subset of which looks like this. Where available, I want the "NA" values to be replaced by the rsid values in the other df.

df1:
SNP          A1  A2   rsid
1:100000012   A   G   rs1234
1:1000066     T   C   <NA>
1:2032101     C   T   rs5678
df2:
SNP           A1  A2   rsid
2:107877      A   G   rs1112023
3:1000066     T   C   rs8213723
1:1000066     T   C   rs7778899

This is what I want where the NA is replaced by the rsid values of the other df. In this example, the rsid of row 3 for df2 replaces the NA value of the rsid for row 2 for df1. I only want the new df to include rows in df1, like so.

df3
SNP          A1  A2   rsid
1:100000012   A   G   rs1234
1:1000066     T   C   rs7778899
1:2032101     C   T   rs5678

I tried this, but am getting some error messages. Can someone help?

library(dplyr)
bind_rows(df1, df2) %>% 
   group_by(SNP, A1, A2) %>%
   summarise(rsid = rsid[complete.cases(rsid)], .groups = 'drop')
Error: Column `rsid` must be length 1 (a summary value), not 2
In addition: Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector

Upvotes: 1

Views: 79

Answers (1)

akrun
akrun

Reputation: 887911

We can bind the datasets together with bind_rows and then do a group by summarise while removing the NA with complete.cases (dplyr version >= 1.0)

library(dplyr)
bind_rows(df1, df2) %>% 
   group_by(SNP, A1, A2) %>%
   summarise(rsid = rsid[complete.cases(rsid)], .groups = 'drop')

-output

# A tibble: 5 x 4
#  SNP         A1    A2    rsid     
#  <chr>       <chr> <chr> <chr>    
#1 1:100000012 A     G     rs1234   
#2 1:1000066   T     C     rs7778899
#3 1:2032101   C     T     rs5678   
#4 2:107877    A     G     rs1112023
#5 3:1000066   T     C     rs8213723

If the version of dplyr is < 1.0, summarise expects the output to be of length 1 per group. We can wrap it in a list and then unnest

bind_rows(df1, df2) %>% 
   group_by(SNP, A1, A2) %>%
   summarise(rsid = list(rsid[complete.cases(rsid)])) %>%
   ungroup %>%
   unnest(c(rsid))

Update

Based on the updated post, if we need to update the column 'rsid' based on the second data, an option is to do a join and then assign (:=) after coalescing the 'rsid' columns

library(data.table)
setDT(df1)[df2, rsid := fcoalesce(rsid, i.rsid), on = .(SNP, A1, A2)]

-output

df1
#           SNP A1 A2      rsid
#1: 1:100000012  A  G    rs1234
#2:   1:1000066  T  C rs7778899
#3:   1:2032101  C  T    rs5678

A similar option is also possible with dplyr

left_join(df1, df2, by = c('SNP', 'A1', 'A2')) %>%       
  transmute(SNP, A1, A2, rsid = coalesce(rsid.x, rsid))
 

data

df1 <- structure(list(SNP = c("1:100000012", "1:1000066", "1:2032101"
), A1 = c("A", "T", "C"), A2 = c("G", "C", "T"), rsid = c("rs1234", 
NA, "rs5678")), class = "data.frame", row.names = c(NA, -3L))

df2 <- structure(list(SNP = c("2:107877", "3:1000066", "1:1000066"), 
    A1 = c("A", "T", "T"), A2 = c("G", "C", "C"), rsid = c("rs1112023", 
    "rs8213723", "rs7778899")), class = "data.frame", row.names = c(NA, 
-3L))

Upvotes: 1

Related Questions