Reputation: 377

Summarize data frame to return non-NA values along subsets

Hoping that someone can help me with a trick. I've found similar questions online, but none of the examples I've seen do exactly what I'm looking for or work on my data structure.

I need to remove NAs from a data frame along data subsets and compress the remaining NA values into rows for each data subset.

Example:

#create example data
a <- c(1, 1, 1, 2, 2, 2) #this is the subsetting variable in the example
b <- c(NA, NA, "B", NA, NA, "C") #max 1 non-NA value for each subset
c <- c("A", NA, NA, "A", NA, NA)
d <- c(NA, NA, 1, NA, NA, NA) #some subsets for some columns have all NA values

dat <- as.data.frame(cbind(a, b, c, d)) 

> desired output
  a b c    d
  1 B A    1
  2 C A <NA>

Rules of thumb: 1) Need to remove NA values from each column 2) Loop along data subsets (column "a" in example above) 3) All columns, for each subset, have a max of 1 non-NA value, but some columns may have all NA values

Ideas:

lapply or dplyr is probably helpful to loop along all columns
na.omit is likely helpful, if the subsetting column that has entries for all rows can be ignored (something like as.data.frame(lapply(dat.admin, na.omit))). issue in returning lapply output to data frame if some subsets don't return any non-NA values
x[which.min(is.na(x))] effectively accomplishes this if laboriously applied to each individual column

Any help is appreciated to put the final pieces together! Thank you!

Upvotes: 4

Answers (3)

sbha

Reputation: 10422

Not really sure if this is what you're looking for, but this might work for you. It at least replicates the small sample output you're looking for:

library(dplyr)
library(tidyr)

dat %>% 
  filter_at(vars(b:c), any_vars(!is.na(.))) %>% 
  group_by(a) %>% 
  fill(b) %>% 
  fill(c) %>% 
  filter_at(vars(b:c), all_vars(!is.na(.)))

# A tibble: 2 x 4
# Groups:   a [2]
       a      b      c      d
  <fctr> <fctr> <fctr> <fctr>
1      1      B      A      1
2      2      C      A     NA

You could also use just dplyr:

dat %>%
  group_by(a) %>%
  summarise_each(funs(first(.[!is.na(.)])))

Upvotes: 1

Frostic

Reputation: 680

Solution with data.table and na.omit

library(data.table)
merge(setDT(dat)[,a[1],keyby=a], setDT(dat)[,na.omit(.SD),keyby=a],all.x=TRUE)

I think the merge statement can be improved

Upvotes: 1

MKR

Reputation: 20095

One solution could be achieved using dplyr::summarise_all. The data needs to be group_by on a.

library(dplyr)

dat %>%
  group_by(a) %>%
  summarise_all(funs(.[which.min(is.na(.))]))
# # A tibble: 2 x 4
#    a      b      c      d     
#   <fctr> <fctr> <fctr> <fctr>
# 1   1      B      A      1     
# 2   2      C      A      <NA>

Upvotes: 3

Summarize data frame to return non-NA values along subsets

Answers (3)

Related Questions