Rair
Rair

Reputation: 1

Will this R code count how many survey submissions had values for all 4 of these variables?

I'm teaching myself R and I think this code counts the number of times a survey has a value (not NA) for all 4 of these variables in ()?

Can someone confirm or correct me? Thanks for helping out a nervous newbie. I need this number for a denominator (surveys without missing data). Thanks!

sum(!is.na(Both_hwstations) & 
    !is.na(Both_latrines) & 
    !is.na(rapid_unique$hf_ipcfocal) & 
    !is.na(rapid_unique$water_avail)
)

Upvotes: 0

Views: 41

Answers (1)

dario
dario

Reputation: 6485

First of all, I think the answer to your question

this code counts the number of times a survey has a value (not NA) for all 4 of these variables in ()?

is yes... or rather yes, but..

Just to illustrate what everyone is commenting on:

This is a simpliefied version of the problem:

varA <- c(7:10)
varB <- c(1:3, NA)
df <- data.frame(v1 = 1:4,
                 v2 = 11:14)

All, varA, varB and df have length of 4 or 4 rows respectively.

varA

[1]  7  8 9 10

varB

[1]  1  2  3 NA

df

  v1 v2
1  1 11
2  2 12
3  3 13
4  4 14

Your sum code

sum(!is.na(varA) & 
    !is.na(varB) & 
    !is.na(df$v1) & 
    !is.na(df$v2))

Returns:

[1] 3

Because it uses TRUE as 1 and FALSE as 0 when trying to do math with booleans. So far so good...

But if we change the vectors to

varA <- c(NA, 0)
varB <- c(1)

varA

[1]  NA 0

varB

[1]  1

What would the expected result of the sum code be in this case? One row with not NA? The result of the sum code depends now on the somewhat idiosyncratic way R handles this situation where vectors have different lengths:

sum(!is.na(varA) & 
      !is.na(varB) & 
      !is.na(df$v1) & 
      !is.na(df$v2))

Returns:

[1] 2

Because R (in the background) doubled varA to make it the same length as df$v1. And varB was repeated four times for that matter...

For that and other reasons its often safer to explicitly bind the columns together first, and then use something like complete.cases to test for this things. That way we can be sure(er) that we don't accidentally mistake artefacts of the programming language as data ;)

We could instead do:

df$varA <- varA
df$varB <- varB
df

  v1 v2 varA varB
1  1 11    7    1
2  2 12    8    2
3  3 13    9    3
4  4 14   10   NA

And use base Rs complete.cases :

df[complete.cases(df), ]


  v1 v2 varA varB
1  1 11    7    1
2  2 12    8    2
3  3 13    9    3

Upvotes: 1

Related Questions