didimichael
didimichael

Reputation: 71

How can I ignore the NA data when I do the lm function?

My question is rather simple, but I could not get it resolved after trying a lot of things.

I have two data frames.

>a
   col1 col2 col3 col4
1    1    2    1    4
2    2   NA    2    3    
3    3    2    3    2    
4    4    3    4    1

> b
  col1 col2 col3 col4
1    5    2    1    4    
2    2   NA    2    3    
3    3   NA    3    2    
4    4    3    4    1

Can I do a lm(a ~ b) to fit the data in a and b?

If I do, how do I ignore the NA data?

Thanks, Dan

Upvotes: 2

Views: 16576

Answers (2)

Spacedman
Spacedman

Reputation: 94182

If a and b are data frames, and you want to regress the individual values in a on the values in b, then you need to convert them to vectors. eg:

> lm(as.vector(as.matrix(a))~as.vector(as.matrix(b)))

Call:
lm(formula = as.vector(as.matrix(a)) ~ as.vector(as.matrix(b)))

Coefficients:
            (Intercept)  as.vector(as.matrix(b))  
               8.418239                -0.005241  

Missing data is by default dropped - see help(lm) and the na.action parameter. The summary method on an lm object will tell you about dropped observations.

Of course ignoring the spatial correlation likely to be present in spatial data will mean your inferences from the parameter estimates will be quite wrong. Map the residuals. And read a good book on spatial stats...

[Edit: oh, and the data frames have to be all numbers or the whole lot gets converted to characters and then... well, who knows...]

Edit:

Another way of getting vectors from data frames is just to use 'unlist':

> a=data.frame(matrix(runif(16),4,4))
> b=data.frame(matrix(runif(16),4,4))
> lm(a~b)
Error in model.frame.default(formula = a ~ b, drop.unused.levels = TRUE) : 
  invalid type (list) for variable 'a'
> lm(unlist(a)~unlist(b))

Call:
lm(formula = unlist(a) ~ unlist(b))

Coefficients:
(Intercept)    unlist(b)  
     0.6488      -0.3137  

I've not seen data.matrix before, thx Gavin.

Upvotes: 2

IRTFM
IRTFM

Reputation: 263332

Generally the regression functions in R will only report the results from complete cases, so you do not usually need to do anything special to hold out cases. Your question seems a bit vague, and it is not clear why you are putting an entire matrix (or is that a data.frame?) on the left-hand side of a formula. There is the capability of doing multi-variate analyses with the lm() function, but people who want to do so will generally ask more specific questions.

> lm(a$col1 ~ b$col1+b$col2 +b$col3+b$col4)

Call:
lm(formula = a$col1 ~ b$col1 + b$col2 + b$col3 + b$col4)

Coefficients:
(Intercept)       b$col1       b$col2       b$col3       b$col4  
         16           -3           NA           NA           NA  

The tiny amount of data prevents any further estimates after losing 2 cases and only having two left.

Upvotes: 4

Related Questions