Reputation: 1589
I have a dataframe data
and after several computations on it, the final dataframe df.final
has some missing values in it.
Before going ahead with further calculations on df.final
, am I better off making all missing values zero's by
data[id.na(data)] <- 0
as mentioned here at How do I replace NA values with zeros in R?, or would doing
df.final <- df.final[complete.cases(df.final), ] # considering only one's without na
be more beneficial?
How are the two different?
Upvotes: 2
Views: 521
Reputation: 460
If you set NA
to zero, then the effect on your calculations is as if you measured it and got zero. So if you're measuring temperatures in July, you'll get results as if you had a few frosty days in there. Your average temperature will be lower.
If you set na.rm=T
or use complete.cases
, the effect is as if that measurement never happened (which is the case, really). So our average temperature in July would be the average only for the days we did measure.
If you only have a few isolated NA values (sum(is.na())
) then you might want to set them all to 0 (or some other sensible value, in this example the average temperature in July might be good).
I would only set to zero if there were vanishingly few (so I don't really care that it's skewing my measurements) or if zero was a sensible value (for example, if we want work experience in months, NA
might well mean "no experience").
Software is soft: if your dataset is small enough, you can try both and observe how much it affects your data.
Upvotes: 2