Chuck
Chuck

Reputation: 3852

Difference between SQLite and PROC SQL when evaluating blank space

I am comparing two datasets to look for duplicate entries on certain columns.

I have done this first in SAS using the PROC SQL command as below(what I consider the true outcome) using the following query:

proc sql;
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND a.yob1 = b.yob2
    AND a.cob1 = b.cob2;
quit;

I output this result to csv giving output_sas.csv

I have also done this in Python using SQLite3 using the same query:

conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND a.yob1 = b.yob2
    AND a.cob1 = b.cob2
""")

I output this to csv giving output_python.csv.

The problem:

The outputs should be the same but they are not:

output_sas.csv contains 123 more records than output_python.csv.

Within the SAS output file, there are 123 records that contain blank space "" within the yob1 and yob2 columns i.e. as an example, The 123 records in the sas_data.csv look like this sample:

yob1 yob2 cob1 cob2 surname1 surname2
""     ""     1    1    xx       xx
""     ""     2    2    yy       yy
.
.
.
# Continues for 123 records

I find that this difference is due to the yob1 and yob2 columns, which, in the above 123 records contains blank space. These 123 record pairs are missing from the output_python.csv file.

[Note: In this work, a string of length zero corresponds to a missing value]

In short:

The PROC SQL routine in SAS is evaluating blank space as equal i.e. "" == "" -> TRUE.

The Python SQLite code appears to be doing opposite i.e. "" == "" -> FALSE

This is happening even though "" == "" -> True in Python.

The question:

Why is this the case and what do I need to change to match up the SQLite output to the PROC SQL output?

Note: Both routines are using the same input datasets. They are entirely equal, and I even manually amend the Python code to ensure that the columns yob1 and yob2 contain "" for missing values.

Update 1:

At the moment my SAS PROC SQL code works on uses data1.sas7bdat, named local and data2.sas7bdat, named neighbor.

To use the same dataset in Python, in SAS, I export these to csv and read in to Python.

If I do:

import pandas as pd
# read in
dflocal     = pd.read_csv(csv_path_local, index_col=False)
dfneighbor  = pd.read_csv(csv_path_neighbor, index_col=False)

Pandas converts missing values to nan. We can use isnull() to find the number of nan values in each of the columns:

# find null / nan values in yob1 and yob2 in each dataset
len(dflocal.loc[dflocal.yob1.isnull()])
78
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
184

To solve the null value problem, I then explicitly convert nan to a string of length zero "" by running:

dflocal['yob1'].fillna(value="", axis=0, inplace=True)
dfneighbor['yob2'].fillna(value="", axis=0, inplace=True)

We can test if the values got updated by testing a known nan:

dflocal.iloc[393].yob1
`""`
type(dflocal.iloc[393].yob1)
str

So they are a string of length 0.

Then read these into SQL via:

dflocal.to_sql('local', con=conn, flavor='sqlite', if_exists='replace', index=False)
dfneighbor.to_sql('neighbor', con=conn, flavor='sqlite', if_exists='replace', index=False)

Then execute the same SQLite3 code:

conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND a.yob1 = b.yob2
    AND a.cob1 = b.cob2
""")

Even though I have made this explicit change I STILL get the same missing 123 values, even though, the null values have been changed to a string of length zero "".

Potential Solution:

However, if I instead import the dataset with the na_filter=False argument , this does the conversion from null to "" for me.

dflocal     = pd.read_csv(csv_path_local, index_col=False, na_filter=False)
dfneighbor  = pd.read_csv(csv_path_neighbor, index_col=False, na_filter=False")

# find null / nan values in yob1 and yob2 in each dataset

len(dflocal.loc[dflocal.yob1.isnull()])
0
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
0

When I import these datasets to my database and run this through the same SQL code:

conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND a.yob1 = b.yob2
    AND a.cob1 = b.cob2
""")

HOORAY I GET THE SAME OUTPUT AS THE SAS CODE!

But Why does the first solution not work? I'm doing the same thing in both cases (the first doing it manually with fill_na, and the second using na_filter=False).

Upvotes: 4

Views: 934

Answers (2)

Tom
Tom

Reputation: 51621

Sounds like you are being hit by the way that SQLite3 (and most DBMS) handle null values. In SAS you can compare null values to actual values but in most DBMS systems you cannot. Thus in SAS complementary logical comparisons like (A=B) and (A ne B) will always yield one as true and the other as false. But in a DBMS when either A or B or both is NULL then both (A=B) and (A ne B) will be FALSE. A NULL value is neither less than nor greater than another value.

In SAS if both values are NULL then they are equal and it one is NULL and the other is not then they are not equal. NULL numeric values are less than any actual number. NULL character variables do not exist and so are just treated as a blank filled value. Note that SAS also ignores trailing blanks when comparing character variables.

What this means in practice is that you need to add extra code to handle the NULL values when querying a DBMS.

ON (a.surname1 = b.surname2 or (a.surname1 is null and b.surname1 is null))
AND (a.yob1 = b.yob2 or (a.yob1 is null and b.yob2 is null))
AND (a.cob1 = b.cob2 or (a.cob1 is null and b.cob2 is null))

Upvotes: 2

user2877959
user2877959

Reputation: 1792

In SAS, there isn't really a concept of null values for characters. It is more of an empty string. However, in most SQL implementations (including SQlite, I assume), a null value and an empty string will be different.

A blank value in SAS is indeed evaluated as "" = "" which is true

In your average DBMS however, what you would call 'blank values' are often null values, not empty strings (""). And null=null is not true. You cannot compare null values with anything, including null values.

What you could do is change your SQlite to

CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND coalesce(a.yob1,'') = coalesce(b.yob2,'')
    AND a.cob1 = b.cob2

The coalesce function will replace yob with an empty string when yob is null.

Be aware however that, if yob1 is null and yob2 actually is an empty string, adding those coalesce functions will change what would have been a null='' condition, which is not true, to a ''='' which is true. If that is not what you'd want, you could also just write it like this:

CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND (a.yob1 = b.yob2
        OR (a.yob1 is null AND a.yob2 is null)
        )
    AND a.cob1 = b.cob2

Upvotes: 3

Related Questions