Mariah Yelenick
Mariah Yelenick

Reputation: 13

Off by one difference between count distinct in sqldf() and tidy code

I am trying to count the number of distinct names from a dataset in R using the sqldf package, and wanted to check my answer using tidy. I got a slightly different answer, and couldn't figure out what caused it. Here's my code:

mayors <- read_csv(file="https://raw.githubusercontent.com/jmontgomery/jmontgomery.github.io/master/PDS/Datasets/Mayors.csv")
mayorsDF <- as.data.frame(mayors)

library(sqldf)
sqldf("select count(distinct FullName) from mayorsDF") # gives me 1406

allNamesDF <- sqldf("select distinct FullName from mayorsDF")
length(allNamesDF$FullName) # gives me 1407

library(tidyverse)
mayors %>% 
    select("FullName") %>%
    unique() %>%
    count() # gives me 1407

What am I missing? I'm new to the sqldf package, but not new to SQL.

Upvotes: 1

Views: 186

Answers (1)

r2evans
r2evans

Reputation: 160437

SQL does not count nulls as unique values in count distinct, and you have nulls in your data.

sqldf::sqldf("select count(*) as n from mayorsDF where FullName is null")
#    n
# 1 36
allNamesDF <- sqldf::sqldf("select distinct FullName from mayorsDF")
anyNA(allNamesDF)
# [1] TRUE
sum(is.na(allNamesDF))
# [1] 1
length(unique(allNamesDF[!is.na(allNamesDF)]))
# [1] 1406

Related: Counting null values as unique value

Upvotes: 3

Related Questions