Reputation: 12550
I have a simple dataframe as such:
ID Col1 Col2 Col3 Col4
1 NA NA NA NA
1 5 10 NA NA
1 NA NA 15 20
2 NA NA NA NA
2 25 30 NA NA
2 NA NA 35 40
And I would like to reformat it as such:
ID Col1 Col2 Col3 Col4
1 5 10 15 20
2 25 30 35 40
(please note: the real data set has thousands of rows and the values are from biological data -- the NA
s follow no simple pattern, except that the NA
s are disjoint, and yes there are exactly 3 rows for each ID
).
STEP ONE: get rid of rows that have only NA
values.
On the surface this looked simple, but I've run across some problems.
complete.cases(DF)
returns all FALSE
, so I can't really use this to remove the rows with all NA
s, as in DF[complete.cases(DF),]
. This is because all rows contain at least one NA
.
Since NA
s want to propagate themselves, other schemes using is.na
fail for the same reason.
STEP TWO: collapse the remaining two rows into one.
Thinking about using something like aggregate
to pull this off, but there has got to be an easier way than this, which doesn't work at all.
Thanks for any advice.
Upvotes: 17
Views: 9881
Reputation: 39858
Since dplyr 1.0.0
, you can also do (using the data provided by @Khashaa):
df %>%
group_by(ID) %>%
summarize(across(everything(), ~ first(na.omit(.))))
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 NA 30 35 40
Upvotes: 8
Reputation: 7373
Try
library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))
Edit: To account for the case in which one column has all NAs
for a certain ID
, we need sum_NA()
function which returns NA
if all are NAs
txt <- "ID Col1 Col2 Col3 Col4
1 NA NA NA NA
1 5 10 NA NA
1 NA NA 15 20
2 NA NA NA NA
2 NA 30 NA NA
2 NA NA 35 40"
DF <- read.table(text = txt, header = TRUE)
# original code
DF %>%
group_by(ID) %>%
summarise_each(funs(sum(., na.rm = TRUE)))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 0 30 35 40
sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}
DF %>%
group_by(ID) %>%
summarise_all(funs(sum_NA))
DF %>%
group_by(ID) %>%
summarise_if(is.numeric, funs(sum_NA))
# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 NA 30 35 40
Upvotes: 12
Reputation: 99331
Here's a data table approach that uses na.omit()
across the columns, grouped by ID.
library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = ID]
# ID Col1 Col2 Col3 Col4
# 1: 1 5 10 15 20
# 2: 2 25 30 35 40
Upvotes: 18
Reputation: 93813
Here's a couple of aggregate attempts:
aggregate(. ~ ID, data=dat, FUN=na.omit, na.action="na.pass")
# ID Col1 Col2 Col3 Col4
#1 1 5 10 15 20
#2 2 25 30 35 40
Since aggregate
's formula interface by default uses na.omit
on the entire data before doing any grouping, it will delete every row of dat
as they all contain at least one NA
value. Try it: nrow(na.omit(dat))
returns 0
. So in this case, use na.pass
in aggregate
and then na.omit
to skip over the NA
s that were passed through.
Alternatively, don't use the formula interface and specify the columns to aggregate manually:
aggregate(dat[-1], dat[1], FUN=na.omit )
aggregate(dat[c("Col1","Col2","Col3","Col4")], dat["ID"], FUN=na.omit)
# ID Col1 Col2 Col3 Col4
#1 1 5 10 15 20
#2 2 25 30 35 40
Upvotes: 9
Reputation: 10167
the simple way is:
as.data.frame(lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)[!is.na(x)]))
but if not all columns have the same number of non-NA
values then you'll need to trim them like so:
temp <- lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)x[!is.na(x)])
len <- min(sapply(temp,length))
as.data.frame(lapply(temp,`[`,seq(len)))
Upvotes: 1