Mahmoud Al-Haroon
Mahmoud Al-Haroon

Reputation: 2449

How to figure the missing values by comparing between two data frames

I just want to figure the missing values between two data frames so.... Here's the code I tried and works fine

import pandas as pd


df1 = pd.DataFrame([1, 2, 3, 4, 5, 6], columns=["my_column"])
df2 = pd.DataFrame([1, 2, 3], columns=["my_column"])

result = df1[~df1.set_index(list(df1)).index.isin(df2.set_index(list(df2)).index)].dropna()


print(result)

Output:

   my_column
3          4
4          5
5          6

So it works fine on a static dataframe....

But I figure a problem when I use this code based on sql: So here's my full code:

import pyodbc
import pandas as pd
import os
import sqlalchemy as db
from sqlalchemy import create_engine, MetaData, Table, Column, Integer, String, Date, Float
import datetime as dt

# connect db
engine = create_engine('mssql+pyodbc://xxxxxxxxxx\SMARTRNO_EXPRESS/myDB?driver=SQL+Server+Native+Client+11.0')
connection = engine.connect()


esn_datafeed_query = 'SELECT * FROM [myDB].[dbo].[esn_datafeed]'
esn_inter_intra_query = 'SELECT * FROM [esn_inter_intra_merge]'

esn_datafeed_df = pd.read_sql(esn_datafeed_query ,engine)
esn_inter_intra_merge_df = (esn_inter_intra_query, engine)

df1 = pd.DataFrame(esn_datafeed_df, columns=["st_umts_df_relation_key"])
df2 = pd.DataFrame(esn_inter_intra_merge_df, columns=["st_umts_esn_inter_intra_relation_key"])

result = df1[~df1.set_index(list(df1)).index.isin(df2.set_index(list(df2)).index)].dropna()


print(result)

So the previous code Is shows all the values, I don't need this... I just want to show missing values only.... I tried it with different way as the below code:

esn_datafeed_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_datafeed]', engine)
esn_inter_intra_merge_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_inter_intra_merge]', engine)

df1 = pd.DataFrame(esn_datafeed_df, columns=["st_umts_df_relation_key"])
df2 = pd.DataFrame(esn_inter_intra_merge_df, columns=["st_umts_esn_inter_intra_relation_key"])

merged = df1.merge(df2 , how="left", indicator=True)
result = merged.query("_merge == 'left_only'")[["st_umts_df_relation_key"]]

print(result)

but I got this error:

Traceback (most recent call last):
  File "C:/Users/haroo501/PycharmProjects/tool_check_nbr/my_missing_result.py", line 18, in <module>
    merged = df1.merge(df2 , how="left", indicator=True)
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\frame.py", line 7336, in merge
    return merge(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 68, in merge
    op = _MergeOperation(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 619, in __init__
    self._validate_specification()
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1183, in _validate_specification
    raise MergeError(
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

Edited

I tried also this code:

df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

but I find this error:

Traceback (most recent call last):
  File "C:/Users/haroo501/PycharmProjects/tool_check_nbr/my_missing_result.py", line 23, in <module>
    df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\frame.py", line 7336, in merge
    return merge(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 68, in merge
    op = _MergeOperation(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 619, in __init__
    self._validate_specification()
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1183, in _validate_specification
    raise MergeError(
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

So to explain in brief related to my database I have two tables

esn_datafeed esn_datafeed

and this is the second table esn_inter_intra_merge

st_umts_esn_inter_intra_relation_key

So now I want to figure the difference between the two tables which I need to figure the values in esn_datafeed.st_umts_df_relation_key which is not in esn_inter_intra_merge.st_umts_esn_inter_intra_relation_key

So anyone have any idea how to solve this..... May be due to the large data in the database?

Is there's a way to do with a query so it will be okay...

Upvotes: 2

Views: 5097

Answers (1)

Alexander
Alexander

Reputation: 109686

I think the issue is that your new dataframes use different names for the columns. However, it sounds like you should be using sets anyway. Here is how to get the symmetric difference between values between two columns.

missing_values = set(df1.iloc[:, 0]).symmetric_difference(set(df2.iloc[:, 0]))
>>> missing_values
{4, 5, 6}

Then you can check if the dataframe values are in these missing values.

>>> df1[df1.iloc[:, 0].isin(missing_values)]
   my_column
3          4
4          5
5          6

EDIT

Upon further reflection, isn't this simply a SQL question that has nothing to do with pandas?

Does something like this work? This SQL query selects all records from t1 (esn_datafeed) where there are no corresponding values of st_umts_df_relation_key in the st_umts_esn_inter_intra_relation_key column of t2 (esn_inter_intra_merge).

SELECT * 
FROM esn_datafeed AS t1
LEFT JOIN esn_inter_intra_merge AS t2
ON t1.st_umts_df_relation_key = t2.st_umts_esn_inter_intra_relation_key
WHERE t2.st_umts_esn_inter_intra_relation_key IS NULL

Upvotes: 1

Related Questions