Reputation: 121
I have observed subjects a
-d
during 2-4 years and each year they reported a numerical value. I want to extract the first and last value of each subject, ignoring NAs. How do I create the new variables first_value
and last_value
? In this example, I have included the desired result:
df <- data.frame(subject = c("a","b","c","d"),
year1 = c(1, 2, NA, NA),
year2 = c(3, 4, NA, 5),
year3 = c(6, 7, 8, NA),
year4 = c(9, 10, NA, 11),
first_value <- c(1, 2, 8, 5),
last_value <- c(9, 10, 8, 11))
And what would be the solution if variables year1
-year4
were categorical?
Upvotes: 1
Views: 2086
Reputation: 83215
Using the data.table
package:
library(data.table)
setDT(df)[, `:=` (first_value = na.omit(unlist(.SD))[1],
last_value = tail(na.omit(unlist(.SD)),1)),
by = subject][]
which gives:
subject year1 year2 year3 year4 first_value last_value
1: a 1 3 6 9 1 9
2: b 2 4 7 10 2 10
3: c NA NA 8 NA 8 8
4: d NA 5 NA 11 5 11
Following the suggestion of @alexis_laz, you can use max.col
as follows to get the repective values:
f <- max.col(!is.na(df[c("year1", "year2", "year3", "year4")]), 'first')
l <- max.col(!is.na(df[c("year1", "year2", "year3", "year4")]), 'last')
df$first_value <- sapply(seq_along(f), function(i) df[,-1][i,f[i]])
df$last_value <- sapply(seq_along(l), function(i) df[,-1][i,l[i]])
which will get you the same result. As suggested by @alexis_laz in the comments, this can be further improved to:
m <- as.matrix(df[c("year1", "year2", "year3", "year4")])
f <- max.col(!is.na(m), 'first')
l <- max.col(!is.na(m), 'last')
df$first_value <- df[-1][cbind(1:nrow(df), f)]
df$last_value <- df[-1][cbind(1:nrow(df), l)]
And using the dplyr
and tidyr
packages:
library(dplyr)
library(tidyr)
df %>%
gather(year, val, 2:5) %>%
filter(!is.na(val)) %>%
group_by(subject) %>%
summarise(first_value = first(val),
last_value = last(val)) %>%
left_join(df, ., by = 'subject')
WARNING: A variation of this without using filter
and using na.omit(val)
(or val[!is.na(val)]
) in summarise
:
df %>%
gather(year, val, 2:5) %>%
group_by(subject) %>%
summarise(first_value = first(na.omit(val)),
last_value = last(na.omit(val))) %>%
left_join(df, ., by = 'subject')
won't work as a result of the bugs reported here and here.
Upvotes: 3
Reputation: 5696
using data.frame
and gather
#Used packages
library(tidyr)
library(dplyr)
subject<-c("a","b","c","d")
year1 <- c(1, 2, NA, NA)
year2 <- c(3, 4, NA, 5)
year3 <- c(6, 7, 8, NA)
year4 <- c(9, 10, NA, 11)
dt = data.frame(subject, year1, year2, year3, year4)
gather()
to collapse the multiple columns into a single column
dt <- dt %>% gather(year, value, year1:year4)
summarise( )
: to perform summary statistics on the selected variables
dt %>% group_by(subject)%>%
summarise(first_value = min(value, na.rm=TRUE),
last_value = max(value, na.rm=TRUE))
output:
# A tibble: 4 × 3
subject first_value last_value
<fctr> <dbl> <dbl>
1 a 1 9
2 b 2 10
3 c 8 8
4 d 5 11
Upvotes: 0