Unexpected row(s) of NAs when selecting subset of dataframe

Question

When selecting a subset of data from a dataframe, I get row(s) entirely made up of NA values that were not present in the original dataframe. For example:

example.df[example.df$census_tract == 27702, ]

returns:

      census_tract number_households_est
NA              NA                    NA
23611        27702                  2864

Where did that first row of NAs come from? And why is it returned even though example.df$census_tract != 27702 for that row?

Luca Braglia · Accepted Answer

That is because there is a missing observation

> sum(is.na(example.df$census_tract))
[1] 1
> example.df[which(is.na(example.df$census_tract)), ]
   census_tract number_households_est
64           NA                    NA

When == evaluates the 64th row it gives NA because by default we can't know wheter 27702 is equal to the missing value. Therefore the result is missing (aka NA). So a NA is putted in the logical vector used for indexing purposes. And this gives, by default, a full-of-NA row, because we are asking for a row but "we don't know which one".

The proper way is

> example.df[example.df$census_tract %in% 27702, ]
      census_tract number_households_est
23611        27702                  2864

HTH, Luca

Unexpected row(s) of NAs when selecting subset of dataframe

Answers (1)

Related Questions