Reputation: 635
I have 2 dfs, a subset of which looks like this. Where available, I want the "NA" values to be replaced by the rsid values in the other df.
df1:
SNP A1 A2 rsid
1:100000012 A G rs1234
1:1000066 T C <NA>
1:2032101 C T rs5678
df2:
SNP A1 A2 rsid
2:107877 A G rs1112023
3:1000066 T C rs8213723
1:1000066 T C rs7778899
This is what I want where the NA is replaced by the rsid values of the other df. In this example, the rsid of row 3 for df2 replaces the NA value of the rsid for row 2 for df1. I only want the new df to include rows in df1, like so.
df3
SNP A1 A2 rsid
1:100000012 A G rs1234
1:1000066 T C rs7778899
1:2032101 C T rs5678
I tried this, but am getting some error messages. Can someone help?
library(dplyr)
bind_rows(df1, df2) %>%
group_by(SNP, A1, A2) %>%
summarise(rsid = rsid[complete.cases(rsid)], .groups = 'drop')
Error: Column `rsid` must be length 1 (a summary value), not 2
In addition: Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
Upvotes: 1
Views: 79
Reputation: 887911
We can bind the datasets together with bind_rows
and then do a group by summarise while removing the NA
with complete.cases
(dplyr
version >= 1.0
)
library(dplyr)
bind_rows(df1, df2) %>%
group_by(SNP, A1, A2) %>%
summarise(rsid = rsid[complete.cases(rsid)], .groups = 'drop')
-output
# A tibble: 5 x 4
# SNP A1 A2 rsid
# <chr> <chr> <chr> <chr>
#1 1:100000012 A G rs1234
#2 1:1000066 T C rs7778899
#3 1:2032101 C T rs5678
#4 2:107877 A G rs1112023
#5 3:1000066 T C rs8213723
If the version of dplyr
is < 1.0
, summarise
expects the output to be of length
1 per group. We can wrap it in a list
and then unnest
bind_rows(df1, df2) %>%
group_by(SNP, A1, A2) %>%
summarise(rsid = list(rsid[complete.cases(rsid)])) %>%
ungroup %>%
unnest(c(rsid))
Based on the updated post, if we need to update the column 'rsid' based on the second data, an option is to do a join and then assign (:=
) after coalescing the 'rsid' columns
library(data.table)
setDT(df1)[df2, rsid := fcoalesce(rsid, i.rsid), on = .(SNP, A1, A2)]
-output
df1
# SNP A1 A2 rsid
#1: 1:100000012 A G rs1234
#2: 1:1000066 T C rs7778899
#3: 1:2032101 C T rs5678
A similar option is also possible with dplyr
left_join(df1, df2, by = c('SNP', 'A1', 'A2')) %>%
transmute(SNP, A1, A2, rsid = coalesce(rsid.x, rsid))
df1 <- structure(list(SNP = c("1:100000012", "1:1000066", "1:2032101"
), A1 = c("A", "T", "C"), A2 = c("G", "C", "T"), rsid = c("rs1234",
NA, "rs5678")), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(SNP = c("2:107877", "3:1000066", "1:1000066"),
A1 = c("A", "T", "T"), A2 = c("G", "C", "C"), rsid = c("rs1112023",
"rs8213723", "rs7778899")), class = "data.frame", row.names = c(NA,
-3L))
Upvotes: 1