tumultous_rooster
tumultous_rooster

Reputation: 12550

Collapsing rows where some are all NA, others are disjoint with some NAs

I have a simple dataframe as such:

ID    Col1    Col2    Col3    Col4
1     NA      NA      NA      NA  
1     5       10      NA      NA
1     NA      NA      15      20
2     NA      NA      NA      NA  
2     25      30      NA      NA
2     NA      NA      35      40 

And I would like to reformat it as such:

ID    Col1    Col2    Col3    Col4
1     5       10      15      20
2     25      30      35      40

(please note: the real data set has thousands of rows and the values are from biological data -- the NAs follow no simple pattern, except that the NAs are disjoint, and yes there are exactly 3 rows for each ID).

STEP ONE: get rid of rows that have only NA values.

On the surface this looked simple, but I've run across some problems.

complete.cases(DF) returns all FALSE, so I can't really use this to remove the rows with all NAs, as in DF[complete.cases(DF),]. This is because all rows contain at least one NA.

Since NAs want to propagate themselves, other schemes using is.na fail for the same reason.

STEP TWO: collapse the remaining two rows into one.

Thinking about using something like aggregate to pull this off, but there has got to be an easier way than this, which doesn't work at all.

Thanks for any advice.

Upvotes: 17

Views: 9881

Answers (5)

tmfmnk
tmfmnk

Reputation: 39858

Since dplyr 1.0.0, you can also do (using the data provided by @Khashaa):

df %>% 
 group_by(ID) %>%
 summarize(across(everything(), ~ first(na.omit(.))))

     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

Upvotes: 8

Khashaa
Khashaa

Reputation: 7373

Try

library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE))) 

Edit: To account for the case in which one column has all NAs for a certain ID, we need sum_NA() function which returns NA if all are NAs

txt <- "ID    Col1    Col2    Col3    Col4
        1     NA      NA      NA      NA
        1     5       10      NA      NA
        1     NA      NA      15      20
        2     NA      NA      NA      NA
        2     NA      30      NA      NA
        2     NA      NA      35      40"
DF <- read.table(text = txt, header = TRUE)

# original code
DF %>% 
  group_by(ID) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2     0    30    35    40

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

DF %>%
  group_by(ID) %>%
  summarise_all(funs(sum_NA))

DF %>%
  group_by(ID) %>%
  summarise_if(is.numeric, funs(sum_NA))

# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

Upvotes: 12

Rich Scriven
Rich Scriven

Reputation: 99331

Here's a data table approach that uses na.omit() across the columns, grouped by ID.

library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = ID]
#    ID Col1 Col2 Col3 Col4
# 1:  1    5   10   15   20
# 2:  2   25   30   35   40

Upvotes: 18

thelatemail
thelatemail

Reputation: 93813

Here's a couple of aggregate attempts:

aggregate(. ~ ID, data=dat, FUN=na.omit, na.action="na.pass")
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40

Since aggregate's formula interface by default uses na.omit on the entire data before doing any grouping, it will delete every row of dat as they all contain at least one NA value. Try it: nrow(na.omit(dat)) returns 0. So in this case, use na.pass in aggregate and then na.omit to skip over the NAs that were passed through.

Alternatively, don't use the formula interface and specify the columns to aggregate manually:

aggregate(dat[-1], dat[1], FUN=na.omit )
aggregate(dat[c("Col1","Col2","Col3","Col4")], dat["ID"], FUN=na.omit)
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40

Upvotes: 9

Jthorpe
Jthorpe

Reputation: 10167

the simple way is:

as.data.frame(lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)[!is.na(x)]))

but if not all columns have the same number of non-NA values then you'll need to trim them like so:

temp  <-  lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)x[!is.na(x)])
len  <-  min(sapply(temp,length))
as.data.frame(lapply(temp,`[`,seq(len)))

Upvotes: 1

Related Questions