Reputation: 3893
When selecting a subset of data from a dataframe, I get row(s) entirely made up of NA values that were not present in the original dataframe. For example:
example.df[example.df$census_tract == 27702, ]
returns:
census_tract number_households_est
NA NA NA
23611 27702 2864
Where did that first row of NAs come from? And why is it returned even though example.df$census_tract != 27702
for that row?
Upvotes: 1
Views: 628
Reputation: 3243
That is because there is a missing observation
> sum(is.na(example.df$census_tract))
[1] 1
> example.df[which(is.na(example.df$census_tract)), ]
census_tract number_households_est
64 NA NA
When ==
evaluates the 64th row it gives NA
because by default we can't know wheter 27702 is equal to the missing value. Therefore the result is missing (aka NA
). So a NA
is putted in the logical vector used for indexing purposes. And this gives, by default, a full-of-NA
row, because we are asking for a row but "we don't know which one".
The proper way is
> example.df[example.df$census_tract %in% 27702, ]
census_tract number_households_est
23611 27702 2864
HTH, Luca
Upvotes: 3