Phil
Phil

Reputation: 85

Fill missing values with linear regression

I have a dataframe that contains 7 columns.

 str(df)

'data.frame':   8760 obs. of  7 variables:
 $ G1_d20_2014.SE1_ : num  25.1 25.1 25 25 25.1 ...
 $ G1_d20_2014.SE4_ : num  42.4 42.3 42.3 42.3 42.3 ...
 $ G1_d20_2014.SE7_ : num  34.4 34.4 34.4 34.4 34.4 ...
 $ G1_d20_2014.SE22_: num  42.5 42.4 42.3 42.4 42.3 ...
 $ G1_d20_2014.SE14_: num  52.5 52.5 52.5 52.5 52.4 ...
 $ G1_d20_2014.SE26 : num  40.8 40.8 40.8 40.8 40.8 ...

Each column represents a unique sensor and the columns contain measurement data from sensors. Some of the columns contain missing values. I want to fill the data gaps in each column by linear regression. I already did this manually but there is one condition that is very important and I'm looking for a function that does this on its own, as it'd take too much time to do this for all the columns. Here's the condition: Lets say G1_d20_2014_SE1 contains missing data. Then I want to fill the data gaps from that sensor with a complete dataset from another sensor where the correlation coefficient is highest.

Here is how I did that manually:

I created a function that creates an indicator variable. Indicator variable turns to 1 if value is not NA and to 0 if it is NA. Then I added this variable as a column to the dataset:

Indvar <- function(t) {

  x <- dim(length(t))
  x[which(!is.na(t))] = 1
  x[which(is.na(t))] = 0 
  return(x)
}

df$I <- Indvar(df$G1_d20_2014.SE1_)

Next I looked between which sensor and sensor 1 the correlation coefficient is highest (in that case correlation coefficient highest between SE1 and SE14). Then I computed the linear regression, took the equation from it and put it into a for loop that fills up the NA values according to the equation whenever the indicator variable is 0:

lm(df$G1_d20_2014.SE1_ ~ df$G1_d20_2014.SE14_, data = df)

for (i in 1:nrow(df)) {

  if (df$I[i] == 0)

  {

    df$G1_d20_2014.SE1_[i] = 8.037 + 0.315*df$G1_d20_2014.SE14_[i]
  }
}

This works perfectly fine but it takes too much time doing this because I have a lot of dataframes that looks like the one up in the post.

I already tried using impute_lm from the simputation package but unfortunately it does not seem to care about where the correlation is highest before filling the data gaps. Here is what I wrote:

impute_fun <- impute_lm(df, 
    formula = SE1_ + SE4_ ~ SE14_ + SE26)

As I wrote SE14_ + SE26_ I checked if he uses the values from SE14 for imputing the values in SE1 but he doesn't, as the result is different from my manual result.

Is there any function that does what I want? I'm really frustrated because I've been looking for this for over 2 weeks now. I'd really really appreciate some help!

EDIT/Answer to @jay.sf

So I tried to make a function (s. below) out of it but there's something I struggle with:

I don't know how to specify in the function that I want to do this for for every column and that it removes the name of that sensor that I want to fill from the sapply(c("SE1_", "SE2_", ...) Because obviously, if I do this for SE1_ and SE1_ is still in the code the correlation will be 1 and nothing happens. Now as you can see this is also problematic for the rest of the code, e.g. in the line cor(df$SE1_, df[, x], use = "complete.obs")) as it says df$SE1_ here. Same for the df$SE1_imp <- ... line. Of course I could just delete the sensor from the sapply(...) code so the first problem does not occur. I'm just wondering if there's a nicer way to do this. Same for the df$SE1_ parts, if I wanna impute the values for SE2_ then I'd have to change df$SE1_ to df$SE2_ and so on.

I tried to run the code like this (but without the SE1_ in the sapply(...) of course) and I got the error: Error in df[, x] : incorrect number of dimensions. Any ideas how to solve these issues?

      impFUN <- function(df) {

      corr <- sapply(c("SE1_", "SE2_", "SE4_", "SE5_","SE6_",                      
                      "SE7_", "SE12_", "SE13_","SE14_", "SE15_",
                      "SE16_", "SE22_","SE23", "SE24", "SE25",
                      "SE26",  "SE33", "SE34", "SE35", "SE36",
                      "SE37", "SE46", "SE51", "SE52", "SE53",
                      "SE54", "SE59", "SE60", "SE61", "SE62", 
                      "SE68", "SE69", "SE70", "SE71", "SE72", 
                      "SE73","SE74", "SE82", "SE83", "SE84", 
                      "SE85", "SE86", "SE87", "SE99","SE100", 
                      "SE101", "SE102", "SE103","SE104", 
                      "SE106", "SE107","SE121"),  function(x)
                  cor(df$SE1_, df[, x], use = "complete.obs")) 

      imp.use <- names(which.max(corr)) 

      regr.model <- lm(reformulate(imp.use, "SE1_"))

      df$SE1_imp <- 
          ifelse(is.na(df$SE1_), lm.cf[1] + df[[imp.use]]*lm.cf[2], df$SE1_)

    }

Upvotes: 1

Views: 1706

Answers (1)

jay.sf
jay.sf

Reputation: 72683

What about this? First check which sensor correlates most with sensor 1.

corr <- sapply(c("sensor.2", "sensor.3", "sensor.4"), function(x) 
  cor(dat$sensor.1, dat[,x], use="complete.obs"))
#   sensor.2    sensor.3    sensor.4 
# 0.04397132  0.26880412 -0.06487781 

imp.use <- names(which.max(corr))
# [1] "sensor.3"

Calculate the regression model,

lm.cf <- lm(reformulate(imp.use, "sensor.1"), dat)$coef

and to impute sensor 1 use the coefficients in an ifelse like this:

dat$sensor.1.imp <- 
  ifelse(is.na(dat$sensor.1), lm.cf[1] + dat[[imp.use]]*lm.cf[2], dat$sensor.1)

Result

head(dat)
#     sensor.1   sensor.2   sensor.3    sensor.4 sensor.1.imp
# 1  2.0348728 -0.6374294  2.0005714  0.03403394    2.0348728
# 2 -0.8830567 -0.8779942  0.7914632 -0.66143678   -0.8830567
# 3         NA  1.2481243 -0.9897785 -0.36361831   -0.1943438
# 4         NA -0.1162450  0.6672969 -2.84821295    0.2312968
# 5  1.0407590  0.1906306  0.3327787  1.16064011    1.0407590
# 6  0.5817020 -0.6133034  0.5689318  0.71543751    0.5817020

Toy data:

library('MASS')
set.seed(42)
M <- mvrnorm(n=1e2, mu=c(0, 0, 0, 0), 
             Sigma=matrix(c(1, .2, .3, .1,
                            .2, 1, 0, 0, 
                            .3, 0, 1, 0,
                            .1, 0, 0, 1), nrow=4),
             empirical=TRUE)
dat <- as.data.frame(`colnames<-`(M, paste0("sensor.", 1:4)))
dat[sample(1:nrow(dat), 30), "sensor.1"] <- NA  ## generate 30% missings

Upvotes: 1

Related Questions