Omry Atia
Omry Atia

Reputation: 2443

difference between first non-NA and last non-NA in each row

I have a data frame with up to 5 measurements (x) and their corresponding time:

df = structure(list(x1 = c(92.9595722286402, 54.2085219673818, 
46.3227062573019, 
NA, 65.1501442134141, 49.736451235317), time1 = c(43.2715277777778, 
336.625, 483.975694444444, NA, 988.10625, 510.072916666667), 
x2 = c(82.8368681534474, 53.7981639701784, 12.9993531230419, 
NA, 64.5678816290574, 55.331442940348), time2 = c(47.8166666666667, 
732, 506.747222222222, NA, 1455.25486111111, 958.976388888889
), x3 = c(83.5433119686794, 65.723072881366, 19.0147593408309, 
NA, 65.1989838202356, 36.7000828457705), time3 = c(86.5888888888889, 
1069.02083333333, 510.275, NA, 1644.21527777778, 1154.95694444444
), x4 = c(NA, 66.008102917677, 40.6243513885846, NA, 62.1694420909955, 
29.0078249523063), time4 = c(NA, 1379.22986111111, 520.726388888889, 
NA, 2057.20833333333, 1179.86805555556), x5 = c(NA, 61.0047472617535, 
45.324715258421, NA, 59.862110645527, 45.883161439362), time5 = c(NA, 
1825.33055555556, 523.163888888889, NA, 3352.26944444444, 
1364.99513888889)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))

"NA" means that the person (row) didn't have a measurement.

I would like to calculate the difference between the last existing measurement and the first one.

So for the first one it would be x3 minus x1 (6.4), for the second it would be -6.8 and so on.

I tried something like this, which didnt work:

df$diff = apply(df %>% select(., contains("x")), 1, function(x) head(x, 
na.rm = T) - tail(x, na.rm=T))

Any suggestions? Also, is apply/rowwise the most efficient way, or is there a vectorized function to do that?

Upvotes: 1

Views: 160

Answers (2)

akrun
akrun

Reputation: 886948

We can use tidyverse methods on the tbl_df. Create a row names column (rownames_to_column), gather the 'x' columns to 'long' format while removing the NA elements (na.rm = TRUE), grouped by row name, get the difference of first and last 'val'ues and bind the extracted column with the original dataset 'df'

library(tidyverse)
rownames_to_column(df, 'rn') %>% 
    select(rn, starts_with('x')) %>% 
    gather(key, val, -rn, na.rm = TRUE) %>%
    group_by(rn) %>%
    summarise(Diff = diff(c(first(val), last(val)))) %>% 
    mutate(rn = as.numeric(rn)) %>%
    complete(rn = min(rn):max(rn)) %>% 
    pull(Diff) %>%
    bind_cols(df, new_col = .)
# A tibble: 6 x 11
#     x1 time1    x2  time2    x3  time3    x4 time4    x5 time5 new_col
#  <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#1  93.0  43.3  82.8   47.8  83.5   86.6  NA     NA   NA     NA   -9.42 
#2  54.2 337.   53.8  732    65.7 1069.   66.0 1379.  61.0 1825.   6.80 
#3  46.3 484.   13.0  507.   19.0  510.   40.6  521.  45.3  523.  -0.998
#4  NA    NA    NA     NA    NA     NA    NA     NA   NA     NA   NA    
#5  65.2 988.   64.6 1455.   65.2 1644.   62.2 2057.  59.9 3352.  -5.29 
#6  49.7 510.   55.3  959.   36.7 1155.   29.0 1180.  45.9 1365.  -3.85 

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388817

A vectorized way would be using max.col where we get "first" and "last" non-NA value using ties.method parameter

#Get column number of first and last col
first_col <- max.col(!is.na(df[x_cols]), ties.method = "first")
last_col <- max.col(!is.na(df[x_cols]), ties.method = "last")

#subset the dataframe to include only `"x"` cols
new_df <- as.data.frame(df[grep("^x", names(df))])

#Subtract last non-NA value with the first one
df$new_calc <- new_df[cbind(1:nrow(df), last_col)] - 
               new_df[cbind(1:nrow(df), first_col)]

Using apply you could do

x_cols <- grep("^x", names(df))

df$new_calc <- apply(df[x_cols], 1, function(x) {
    new_x <- x[!is.na(x)]
    if (length(new_x) > 0)
      new_x[length(new_x)] - new_x[1L]
    else NA
})

Upvotes: 1

Related Questions