griffmer
griffmer

Reputation: 377

Summarize data frame to return non-NA values along subsets

Hoping that someone can help me with a trick. I've found similar questions online, but none of the examples I've seen do exactly what I'm looking for or work on my data structure.

I need to remove NAs from a data frame along data subsets and compress the remaining NA values into rows for each data subset.

Example:

#create example data
a <- c(1, 1, 1, 2, 2, 2) #this is the subsetting variable in the example
b <- c(NA, NA, "B", NA, NA, "C") #max 1 non-NA value for each subset
c <- c("A", NA, NA, "A", NA, NA)
d <- c(NA, NA, 1, NA, NA, NA) #some subsets for some columns have all NA values

dat <- as.data.frame(cbind(a, b, c, d)) 

> desired output
  a b c    d
  1 B A    1
  2 C A <NA>

Rules of thumb: 1) Need to remove NA values from each column 2) Loop along data subsets (column "a" in example above) 3) All columns, for each subset, have a max of 1 non-NA value, but some columns may have all NA values

Ideas:

Any help is appreciated to put the final pieces together! Thank you!

Upvotes: 4

Views: 1437

Answers (3)

sbha
sbha

Reputation: 10422

Not really sure if this is what you're looking for, but this might work for you. It at least replicates the small sample output you're looking for:

library(dplyr)
library(tidyr)

dat %>% 
  filter_at(vars(b:c), any_vars(!is.na(.))) %>% 
  group_by(a) %>% 
  fill(b) %>% 
  fill(c) %>% 
  filter_at(vars(b:c), all_vars(!is.na(.)))

# A tibble: 2 x 4
# Groups:   a [2]
       a      b      c      d
  <fctr> <fctr> <fctr> <fctr>
1      1      B      A      1
2      2      C      A     NA

You could also use just dplyr:

dat %>%
  group_by(a) %>%
  summarise_each(funs(first(.[!is.na(.)])))  

Upvotes: 1

Frostic
Frostic

Reputation: 680

Solution with data.table and na.omit

library(data.table)
merge(setDT(dat)[,a[1],keyby=a], setDT(dat)[,na.omit(.SD),keyby=a],all.x=TRUE)

I think the merge statement can be improved

Upvotes: 1

MKR
MKR

Reputation: 20095

One solution could be achieved using dplyr::summarise_all. The data needs to be group_by on a.

library(dplyr)

dat %>%
  group_by(a) %>%
  summarise_all(funs(.[which.min(is.na(.))]))
# # A tibble: 2 x 4
#    a      b      c      d     
#   <fctr> <fctr> <fctr> <fctr>
# 1   1      B      A      1     
# 2   2      C      A      <NA>  

Upvotes: 3

Related Questions