Sebastian Zeki
Sebastian Zeki

Reputation: 6874

Combine rows so that values are combined and NAs ignored

I have a dataframe as follows:

TIMEdbMerge CopyNumber  Study     Sample       HRE
TC015II         NA     TC015        II        neg       
TC015III        0      NA           NA        NA    
TC015III        NA     TC015        III       neg   
TC015Quadrantic NA     TC015    Quadrantic    24    
TC016I          NA     TC016         I        NA    
TC016II         1      NA           NA        NA      
TC016II         NA     TC016        II        neg   
TC016Quadrantic NA     TC016    Quadrantic    6 
TC017I          NA     TC017        I         NA    
TC017II          3      NA          NA        NA
TC017II         NA     TC017        II         +

It is from a complex merge that I don't quite have the time to figure out. As a workaround I just want to merge the duplicate rows so that the actual values in the rows replace the NAs for the pair of duplicates so that the result should look like:

TIMEdbMerge CopyNumber  Study     Sample       HRE
TC015II         NA     TC015        II        neg           
TC015III        0      TC015        III       neg   
TC015           NA     TC015         Q        24    
TC016I          NA     TC016         I        NA    
TC016II         1      TC016        II        neg   
TC016Quadrantic NA     TC016    Quadrantic    6 
TC017I          NA     TC017        I         NA    
TC017II         3      TC017        II         +

I know how to remove duplicate rows but I dont know how to tell r to combine a duplicate row but only use the value when it is not NA from either row of the duplicate. Should I use aggregate?

Upvotes: 2

Views: 754

Answers (1)

akrun
akrun

Reputation: 886948

We can na.locf to fill the NA elements by the non-NA elements of 'CopyNumber' within each group ('TIMEdbMerge') by ave. Then remove the rows that have all the elements as NA for 'Study', 'Sample', 'HRE' columns

library(zoo)
df1$CopyNumber <- with(df1, ave(CopyNumber, TIMEdbMerge,
     FUN=function(x) na.locf(x, na.rm=FALSE)))
df1[rowSums(is.na(df1[3:5]))!=3,]
#       TIMEdbMerge CopyNumber Study     Sample  HRE
#1          TC015II         NA TC015         II  neg
#3         TC015III          0 TC015        III  neg
#4  TC015Quadrantic         NA TC015 Quadrantic   24
#5           TC016I         NA TC016          I <NA>
#7          TC016II          1 TC016         II  neg
#8  TC016Quadrantic         NA TC016 Quadrantic    6
#9           TC017I         NA TC017          I <NA>
#11         TC017II          3 TC017         II    +

Or use left_join (or merge from base R) of original dataset with a subset of dataset that includes only the non-NA rows of 'CopyNumber', and then filter out the rows that are NAs for the 3 columns as mentioned above.

library(dplyr)
 left_join(df1, filter(df1, !is.na(CopyNumber)) %>%
                       select(1:2), 
                 by='TIMEdbMerge') %>% 
                 select(-2) %>% 
                 filter(rowSums(is.na(.[2:4]))!=3)

data

df1 <- structure(list(TIMEdbMerge = c("TC015II", "TC015III", 
"TC015III", 
"TC015Quadrantic", "TC016I", "TC016II", "TC016II", "TC016Quadrantic", 
"TC017I", "TC017II", "TC017II"), CopyNumber = c(NA, 0L, NA, NA, 
NA, 1L, NA, NA, NA, 3L, NA), Study = c("TC015", NA, "TC015", 
"TC015", "TC016", NA, "TC016", "TC016", "TC017", NA, "TC017"), 
Sample = c("II", NA, "III", "Quadrantic", "I", NA, "II", 
"Quadrantic", "I", NA, "II"), HRE = c("neg", NA, "neg", "24", 
NA, NA, "neg", "6", NA, NA, "+")), .Names = c("TIMEdbMerge", 
"CopyNumber", "Study", "Sample", "HRE"), class = "data.frame", 
row.names = c(NA, -11L))

Upvotes: 3

Related Questions