Reputation: 3852
I am comparing two datasets to look for duplicate entries on certain columns.
I have done this first in SAS using the PROC SQL command as below(what I consider the true outcome) using the following query:
proc sql;
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2;
quit;
I output this result to csv giving output_sas.csv
I have also done this in Python using SQLite3 using the same query:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
I output this to csv giving output_python.csv
.
The outputs should be the same but they are not:
output_sas.csv
contains 123 more records than output_python.csv
.
Within the SAS output file, there are 123 records that contain blank space ""
within the yob1
and yob2
columns i.e. as an example, The 123 records in the sas_data.csv
look like this sample:
yob1 yob2 cob1 cob2 surname1 surname2
"" "" 1 1 xx xx
"" "" 2 2 yy yy
.
.
.
# Continues for 123 records
I find that this difference is due to the yob1
and yob2
columns, which, in the above 123 records contains blank space. These 123 record pairs are missing from the output_python.csv
file.
[Note: In this work, a string of length zero corresponds to a missing value]
The PROC SQL routine in SAS is evaluating blank space as equal i.e. "" == "" -> TRUE
.
The Python SQLite code appears to be doing opposite i.e. "" == "" ->
FALSE
This is happening even though "" == "" -> True
in Python.
Why is this the case and what do I need to change to match up the SQLite output to the PROC SQL output?
Note: Both routines are using the same input datasets. They are entirely equal, and I even manually amend the Python code to ensure that the columns yob1
and yob2
contain ""
for missing values.
At the moment my SAS PROC SQL code works on uses data1.sas7bdat
, named local
and data2.sas7bdat
, named neighbor
.
To use the same dataset in Python, in SAS, I export these to csv and read in to Python.
If I do:
import pandas as pd
# read in
dflocal = pd.read_csv(csv_path_local, index_col=False)
dfneighbor = pd.read_csv(csv_path_neighbor, index_col=False)
Pandas converts missing values to nan
. We can use isnull()
to find the number of nan values in each of the columns:
# find null / nan values in yob1 and yob2 in each dataset
len(dflocal.loc[dflocal.yob1.isnull()])
78
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
184
To solve the null value problem, I then explicitly convert nan
to a string of length zero ""
by running:
dflocal['yob1'].fillna(value="", axis=0, inplace=True)
dfneighbor['yob2'].fillna(value="", axis=0, inplace=True)
We can test if the values got updated by testing a known nan
:
dflocal.iloc[393].yob1
`""`
type(dflocal.iloc[393].yob1)
str
So they are a string of length 0.
Then read these into SQL via:
dflocal.to_sql('local', con=conn, flavor='sqlite', if_exists='replace', index=False)
dfneighbor.to_sql('neighbor', con=conn, flavor='sqlite', if_exists='replace', index=False)
Then execute the same SQLite3 code:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
Even though I have made this explicit change I STILL get the same missing 123 values, even though, the null values have been changed to a string of length zero ""
.
Potential Solution:
However, if I instead import the dataset with the na_filter=False
argument , this does the conversion from null
to ""
for me.
dflocal = pd.read_csv(csv_path_local, index_col=False, na_filter=False)
dfneighbor = pd.read_csv(csv_path_neighbor, index_col=False, na_filter=False")
# find null / nan values in yob1 and yob2 in each dataset
len(dflocal.loc[dflocal.yob1.isnull()])
0
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
0
When I import these datasets to my database and run this through the same SQL code:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
HOORAY I GET THE SAME OUTPUT AS THE SAS CODE!
But Why does the first solution not work? I'm doing the same thing in both cases (the first doing it manually with fill_na
, and the second using na_filter=False
).
Upvotes: 4
Views: 934
Reputation: 51621
Sounds like you are being hit by the way that SQLite3 (and most DBMS) handle null values. In SAS you can compare null values to actual values but in most DBMS systems you cannot. Thus in SAS complementary logical comparisons like (A=B) and (A ne B) will always yield one as true and the other as false. But in a DBMS when either A or B or both is NULL then both (A=B) and (A ne B) will be FALSE. A NULL value is neither less than nor greater than another value.
In SAS if both values are NULL then they are equal and it one is NULL and the other is not then they are not equal. NULL numeric values are less than any actual number. NULL character variables do not exist and so are just treated as a blank filled value. Note that SAS also ignores trailing blanks when comparing character variables.
What this means in practice is that you need to add extra code to handle the NULL values when querying a DBMS.
ON (a.surname1 = b.surname2 or (a.surname1 is null and b.surname1 is null))
AND (a.yob1 = b.yob2 or (a.yob1 is null and b.yob2 is null))
AND (a.cob1 = b.cob2 or (a.cob1 is null and b.cob2 is null))
Upvotes: 2
Reputation: 1792
In SAS, there isn't really a concept of null values for characters. It is more of an empty string. However, in most SQL implementations (including SQlite, I assume), a null value and an empty string will be different.
A blank value in SAS is indeed evaluated as "" = ""
which is true
In your average DBMS however, what you would call 'blank values' are often null
values, not empty strings (""
). And null=null
is not true. You cannot compare null values with anything, including null values.
What you could do is change your SQlite to
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND coalesce(a.yob1,'') = coalesce(b.yob2,'')
AND a.cob1 = b.cob2
The coalesce function will replace yob
with an empty string when yob
is null.
Be aware however that, if yob1
is null and yob2
actually is an empty string, adding those coalesce functions will change what would have been a null=''
condition, which is not true, to a ''=''
which is true.
If that is not what you'd want, you could also just write it like this:
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND (a.yob1 = b.yob2
OR (a.yob1 is null AND a.yob2 is null)
)
AND a.cob1 = b.cob2
Upvotes: 3