Reputation: 2443
I have a data frame with up to 5 measurements (x) and their corresponding time:
df = structure(list(x1 = c(92.9595722286402, 54.2085219673818,
46.3227062573019,
NA, 65.1501442134141, 49.736451235317), time1 = c(43.2715277777778,
336.625, 483.975694444444, NA, 988.10625, 510.072916666667),
x2 = c(82.8368681534474, 53.7981639701784, 12.9993531230419,
NA, 64.5678816290574, 55.331442940348), time2 = c(47.8166666666667,
732, 506.747222222222, NA, 1455.25486111111, 958.976388888889
), x3 = c(83.5433119686794, 65.723072881366, 19.0147593408309,
NA, 65.1989838202356, 36.7000828457705), time3 = c(86.5888888888889,
1069.02083333333, 510.275, NA, 1644.21527777778, 1154.95694444444
), x4 = c(NA, 66.008102917677, 40.6243513885846, NA, 62.1694420909955,
29.0078249523063), time4 = c(NA, 1379.22986111111, 520.726388888889,
NA, 2057.20833333333, 1179.86805555556), x5 = c(NA, 61.0047472617535,
45.324715258421, NA, 59.862110645527, 45.883161439362), time5 = c(NA,
1825.33055555556, 523.163888888889, NA, 3352.26944444444,
1364.99513888889)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))
"NA" means that the person (row) didn't have a measurement.
I would like to calculate the difference between the last existing measurement and the first one.
So for the first one it would be x3 minus x1 (6.4), for the second it would be -6.8 and so on.
I tried something like this, which didnt work:
df$diff = apply(df %>% select(., contains("x")), 1, function(x) head(x,
na.rm = T) - tail(x, na.rm=T))
Any suggestions? Also, is apply/rowwise the most efficient way, or is there a vectorized function to do that?
Upvotes: 1
Views: 160
Reputation: 886948
We can use tidyverse
methods on the tbl_df
. Create a row names column (rownames_to_column
), gather
the 'x' columns to 'long' format while removing the NA elements (na.rm = TRUE
), grouped by row name, get the diff
erence of first
and last
'val'ues and bind the extracted column with the original dataset 'df'
library(tidyverse)
rownames_to_column(df, 'rn') %>%
select(rn, starts_with('x')) %>%
gather(key, val, -rn, na.rm = TRUE) %>%
group_by(rn) %>%
summarise(Diff = diff(c(first(val), last(val)))) %>%
mutate(rn = as.numeric(rn)) %>%
complete(rn = min(rn):max(rn)) %>%
pull(Diff) %>%
bind_cols(df, new_col = .)
# A tibble: 6 x 11
# x1 time1 x2 time2 x3 time3 x4 time4 x5 time5 new_col
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 93.0 43.3 82.8 47.8 83.5 86.6 NA NA NA NA -9.42
#2 54.2 337. 53.8 732 65.7 1069. 66.0 1379. 61.0 1825. 6.80
#3 46.3 484. 13.0 507. 19.0 510. 40.6 521. 45.3 523. -0.998
#4 NA NA NA NA NA NA NA NA NA NA NA
#5 65.2 988. 64.6 1455. 65.2 1644. 62.2 2057. 59.9 3352. -5.29
#6 49.7 510. 55.3 959. 36.7 1155. 29.0 1180. 45.9 1365. -3.85
Upvotes: 1
Reputation: 388817
A vectorized way would be using max.col
where we get "first"
and "last"
non-NA value using ties.method
parameter
#Get column number of first and last col
first_col <- max.col(!is.na(df[x_cols]), ties.method = "first")
last_col <- max.col(!is.na(df[x_cols]), ties.method = "last")
#subset the dataframe to include only `"x"` cols
new_df <- as.data.frame(df[grep("^x", names(df))])
#Subtract last non-NA value with the first one
df$new_calc <- new_df[cbind(1:nrow(df), last_col)] -
new_df[cbind(1:nrow(df), first_col)]
Using apply
you could do
x_cols <- grep("^x", names(df))
df$new_calc <- apply(df[x_cols], 1, function(x) {
new_x <- x[!is.na(x)]
if (length(new_x) > 0)
new_x[length(new_x)] - new_x[1L]
else NA
})
Upvotes: 1