Reputation: 6874
I have a dataframe as follows:
TIMEdbMerge CopyNumber Study Sample HRE
TC015II NA TC015 II neg
TC015III 0 NA NA NA
TC015III NA TC015 III neg
TC015Quadrantic NA TC015 Quadrantic 24
TC016I NA TC016 I NA
TC016II 1 NA NA NA
TC016II NA TC016 II neg
TC016Quadrantic NA TC016 Quadrantic 6
TC017I NA TC017 I NA
TC017II 3 NA NA NA
TC017II NA TC017 II +
It is from a complex merge that I don't quite have the time to figure out. As a workaround I just want to merge the duplicate rows so that the actual values in the rows replace the NAs for the pair of duplicates so that the result should look like:
TIMEdbMerge CopyNumber Study Sample HRE
TC015II NA TC015 II neg
TC015III 0 TC015 III neg
TC015 NA TC015 Q 24
TC016I NA TC016 I NA
TC016II 1 TC016 II neg
TC016Quadrantic NA TC016 Quadrantic 6
TC017I NA TC017 I NA
TC017II 3 TC017 II +
I know how to remove duplicate rows but I dont know how to tell r to combine a duplicate row but only use the value when it is not NA from either row of the duplicate. Should I use aggregate?
Upvotes: 2
Views: 754
Reputation: 886948
We can na.locf
to fill the NA elements by the non-NA elements of 'CopyNumber' within each group ('TIMEdbMerge') by ave
. Then remove the rows that have all the elements as NA
for 'Study', 'Sample', 'HRE' columns
library(zoo)
df1$CopyNumber <- with(df1, ave(CopyNumber, TIMEdbMerge,
FUN=function(x) na.locf(x, na.rm=FALSE)))
df1[rowSums(is.na(df1[3:5]))!=3,]
# TIMEdbMerge CopyNumber Study Sample HRE
#1 TC015II NA TC015 II neg
#3 TC015III 0 TC015 III neg
#4 TC015Quadrantic NA TC015 Quadrantic 24
#5 TC016I NA TC016 I <NA>
#7 TC016II 1 TC016 II neg
#8 TC016Quadrantic NA TC016 Quadrantic 6
#9 TC017I NA TC017 I <NA>
#11 TC017II 3 TC017 II +
Or use left_join
(or merge
from base R
) of original dataset with a subset of dataset that includes only the non-NA rows of 'CopyNumber', and then filter
out the rows that are NAs for the 3 columns as mentioned above.
library(dplyr)
left_join(df1, filter(df1, !is.na(CopyNumber)) %>%
select(1:2),
by='TIMEdbMerge') %>%
select(-2) %>%
filter(rowSums(is.na(.[2:4]))!=3)
df1 <- structure(list(TIMEdbMerge = c("TC015II", "TC015III",
"TC015III",
"TC015Quadrantic", "TC016I", "TC016II", "TC016II", "TC016Quadrantic",
"TC017I", "TC017II", "TC017II"), CopyNumber = c(NA, 0L, NA, NA,
NA, 1L, NA, NA, NA, 3L, NA), Study = c("TC015", NA, "TC015",
"TC015", "TC016", NA, "TC016", "TC016", "TC017", NA, "TC017"),
Sample = c("II", NA, "III", "Quadrantic", "I", NA, "II",
"Quadrantic", "I", NA, "II"), HRE = c("neg", NA, "neg", "24",
NA, NA, "neg", "6", NA, NA, "+")), .Names = c("TIMEdbMerge",
"CopyNumber", "Study", "Sample", "HRE"), class = "data.frame",
row.names = c(NA, -11L))
Upvotes: 3