Syed Ahmed
Syed Ahmed

Reputation: 209

Calculating column differences while handling NA's

I'm fairly new to R and have encountered an interesting problem. My dataset has a lot of NA's and handling them has become a bit problematic.

What I have:

  S1 S2 S3 S4
  NA NA 3  4
  1  3  NA NA
  3  NA 5  NA

What I want to do: Calculate Diff = S4 - S3 - S2 - S1. But if S4 is NA then it should do Diff = S3 - S2 - S1. And so on... But if lets say, S4 and S2 are NA's then the Diff should be S3 - S1.

For example in row 2 the answer should be 2. (3-1)

What I need:

 S1 S2 S3 S4 Diff
 NA NA 3  4  1
 1  3  NA NA 2
 3  NA 5  NA 2

What I have written up till now: (but doesn't work for handling NA's)

data <- cbind(data, Diff = data$S4 - data$S3 -data$S2 -data$S1) 

Additional Info: The values under each column in the original problem are timestamps.

What the data looks like:

head(select(data,S4,S3,S2,S1))

  S4                  S3                  S2                  S1                    
1 2020-07-08 07:46:36 NA                  2020-07-08 07:08:31 NA                 
2 2020-07-08 07:11:57 NA                  2020-07-08 07:08:53 NA                 
3 2020-07-08 07:12:50 NA                  2020-07-08 07:09:40 NA                 
4 2020-07-08 07:13:14 NA                  2020-07-08 07:09:42 NA                 
5 2020-07-08 07:13:46 NA                  2020-07-08 07:10:36 NA                 
6 2020-07-08 07:44:45 NA                  2020-07-08 07:11:42 NA

Updated Version of dataset: (Contains the most accurate representation of the dataset - timestamps are converted to seconds.) What I need? The "Diff" column.

  S4  S3  S2  S1 Diff                    
1 NA  30  40  NA 10                
1 NA  30  NA  40 10                
1 10  30  40  50 40                
1 30  NA  40  NA 10                
1 10  30  40  NA 30                
1 NA  30  40  50 20       

Found the answer here: Take difference between first and last observations in a row, where each row is different

Upvotes: 0

Views: 75

Answers (1)

ThomasIsCoding
ThomasIsCoding

Reputation: 101099

Here is a base R option via apply

df$Diff <- apply(df,1,function(x) 2*tail(na.omit(x),1)-sum(x,na.rm = TRUE))

or

df$Diff <- apply(df,1,function(x) Reduce("-",rev(na.omit(x))))

such that

> df
  S1 S2 S3 S4 Diff
1 NA NA  3  4    1
2  1  3 NA NA    2
3  3 NA  5 NA    2

Data

> dput(df)
structure(list(S1 = c(NA, 1L, 3L), S2 = c(NA, 3L, NA), S3 = c(3L, 
NA, 5L), S4 = c(4L, NA, NA)), class = "data.frame", row.names = c(NA, 
-3L))

Edit

When with timestamp values, maybe you can try the code below

df$Diff <- apply(df,1,function(x) Reduce("-",as.POSIXlt(rev(na.omit(x)))))

such that

> df
  S1                  S2 S3                  S4      Diff
1 NA 2020-07-08 07:08:31 NA 2020-07-08 07:46:36 38.083333
2 NA 2020-07-08 07:08:53 NA 2020-07-08 07:11:57  3.066667
3 NA 2020-07-08 07:09:40 NA 2020-07-08 07:12:50  3.166667
4 NA 2020-07-08 07:09:42 NA 2020-07-08 07:13:14  3.533333
5 NA 2020-07-08 07:10:36 NA 2020-07-08 07:13:46  3.166667
6 NA 2020-07-08 07:11:42 NA 2020-07-08 07:44:45 33.050000

Data

> dput(df)
structure(list(S1 = c(NA, NA, NA, NA, NA, NA), S2 = c("2020-07-08 07:08:31",
"2020-07-08 07:08:53", "2020-07-08 07:09:40", "2020-07-08 07:09:42",
"2020-07-08 07:10:36", "2020-07-08 07:11:42"), S3 = c(NA, NA, 
NA, NA, NA, NA), S4 = c("2020-07-08 07:46:36", "2020-07-08 07:11:57",
"2020-07-08 07:12:50", "2020-07-08 07:13:14", "2020-07-08 07:13:46",
"2020-07-08 07:44:45")), row.names = c("1", "2", "3", "4", "5",
"6"), class = "data.frame")

Upvotes: 2

Related Questions