Sang won kim
Sang won kim

Reputation: 522

Impute na using average on both sides in row

When I visualized my data, it had a series of periodicities in row. But Randomforest imputing and PCA imputing create outlier.

Main problem :

So i want to average on both sides in row. When the NA value of the nth column occurs, the average of n-1 column and n+1 column impute corresponding row.

Sub problem

  1. But first column and last column doesn't have n-1 or n+1, therefore i will take n+1 or n-1.(don't worry, deviation of row is very small.)

  2. It occur na value in row continuosly. It also take n-1 or n+1.

EX:

tr <- structure(list(A_1 = c(NA,2,3,4,5), A_2 =c(4,5,6,NA,8), A_3 = c(7,9,NA,10,11), 
                     A_4 = c(10,12,NA,13,NA), A_5 =c(12,NA,14,15,16), A_6 = c(13,15,15,16,17)), 
                row.names = c(NA, -5L),class = "data.frame")
> tr
  A_1 A_2 A_3 A_4 A_5 A_6
1  NA   4   7  10  12  13
2   2   5   9  12  NA  15
3   3   6  NA  NA  14  15
4   4  NA  10  13  15  16
5   5   8  11  NA  16  17

Desired output

> tr
      A_1 A_2 A_3   A_4   A_5  A_6
    1   4   4   7    10    12   13
    2   2   5   9    12  13.5   15
    3   3   6   6    14    14   15
    4   4   7  10    13    15   16
    5   5   8  11  13.5    16   17

Upvotes: 3

Views: 97

Answers (3)

Sotos
Sotos

Reputation: 51592

One way via dplyr is to convert to long format, take the lag() and lead() of your value column, compute the row means, replace NA and convert back to wide. i.e.

library(dplyr)
library(tidyr)

tr %>% 
 pivot_longer(everything()) %>% 
 mutate(n1 = lag(value), n2 = lead(value)) %>% 
 mutate(res = rowMeans(select(., c(n1, n2)), na.rm = TRUE), 
        value = replace(value, is.na(value), res[is.na(value)])) %>% 
 select(name, value) %>% 
 pivot_wider(names_from = name, values_from = value) %>% 
 unnest()

which gives,

    A_1   A_2   A_3   A_4   A_5   A_6
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     4     4     7  10    12      13
2     2     5     9  12    13.5    15
3     3     6     6  14    14      15
4     4     7    10  13    15      16
5     5     8    11  13.5  16      17

Upvotes: 2

Darren Tsai
Darren Tsai

Reputation: 35624

A base R solution:

t(apply(tr, 1, function(x){
  ifelse(!is.na(x), x,
         rowMeans(cbind(c(NA, x[-length(x)]), c(x[-1], NA)), na.rm = T))
}))

#      A_1 A_2 A_3  A_4  A_5 A_6
# [1,]   4   4   7 10.0 12.0  13
# [2,]   2   5   9 12.0 13.5  15
# [3,]   3   6   6 14.0 14.0  15
# [4,]   4   7  10 13.0 15.0  16
# [5,]   5   8  11 13.5 16.0  17

Upvotes: 1

s_baldur
s_baldur

Reputation: 33613

Using data.table:

# Define helper function
foo <- function(x) sapply(transpose(shift(x, -1:1)), mean, na.rm = TRUE)

setDT(tr)[, melt(.SD, measure.vars = patterns("^A"))
          ][, value := fcoalesce(value, foo(value)), by = rowid(variable)
            ][, dcast(.SD, rowid(variable) ~ variable)
              ][, !"variable"]

   A_1 A_2 A_3  A_4  A_5 A_6
1:   4   4   7 10.0 12.0  13
2:   2   5   9 12.0 13.5  15
3:   3   6   6 14.0 14.0  15
4:   4   7  10 13.0 15.0  16
5:   5   8  11 13.5 16.0  17

Or avoid the melting/dcasting with apply() from base R:

tr[] <- t(apply(tr, 1, function(x) fcoalesce(x, foo(x))))

  A_1 A_2 A_3  A_4  A_5 A_6
1   4   4   7 10.0 12.0  13
2   2   5   9 12.0 13.5  15
3   3   6   6 14.0 14.0  15
4   4   7  10 13.0 15.0  16
5   5   8  11 13.5 16.0  17

Upvotes: 1

Related Questions