kRazzy R
kRazzy R

Reputation: 1589

How are results of complete.cases() and data[is.na(data)] <- 0 different?

I have a dataframe data and after several computations on it, the final dataframe df.final has some missing values in it.

Before going ahead with further calculations on df.final, am I better off making all missing values zero's by

data[id.na(data)] <- 0

as mentioned here at How do I replace NA values with zeros in R?, or would doing

df.final <- df.final[complete.cases(df.final), ] # considering only one's without na 

be more beneficial?

How are the two different?

Upvotes: 2

Views: 521

Answers (1)

Derwin McGeary
Derwin McGeary

Reputation: 460

If you set NA to zero, then the effect on your calculations is as if you measured it and got zero. So if you're measuring temperatures in July, you'll get results as if you had a few frosty days in there. Your average temperature will be lower.

If you set na.rm=T or use complete.cases, the effect is as if that measurement never happened (which is the case, really). So our average temperature in July would be the average only for the days we did measure.

If you only have a few isolated NA values (sum(is.na())) then you might want to set them all to 0 (or some other sensible value, in this example the average temperature in July might be good).

I would only set to zero if there were vanishingly few (so I don't really care that it's skewing my measurements) or if zero was a sensible value (for example, if we want work experience in months, NA might well mean "no experience").

Software is soft: if your dataset is small enough, you can try both and observe how much it affects your data.

Upvotes: 2

Related Questions