Mahmoud Al-Haroon
Mahmoud Al-Haroon

Reputation: 2449

How to figure the missing values in dataframe based on other data frame

I'd like to find values that are in a dataframe, and not in another one. For example I have a dataframe like this:

|---------------------|
|      Col1           |
|---------------------|
|         1111        |
|---------------------|
|         2222        |
|---------------------|
|         3333        |
|---------------------|
|         4444        |
|---------------------|
|         5555        |
|---------------------|
|         6666        |
|---------------------|
|         7777        |
|---------------------|

and another dataframe like this:

|---------------------|
|      Col2           |
|---------------------|
|         1111        |
|---------------------|
|         2222        |
|---------------------|
|         3333        |
|---------------------|
|         4444        |
|---------------------|

I want a result something like this

|---------------------|
|      missing values |
|---------------------|
|         5555        |
|---------------------|
|         6666        |
|---------------------|
|         7777        |
|---------------------|

Edited

esn_datafeed_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_datafeed]', engine)
esn_inter_intra_merge_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_inter_intra_merge]', engine)


merged = esn_datafeed_df.merge(esn_inter_intra_merge_df, how='left', indicator=True)
merged.query("_merge == 'left_only'")[["st_umts_df_relation_key"]]

Error:

Traceback (most recent call last):
  File "C:/Users/haroo501/PycharmProjects/tool_check_nbr/my_missing_result.py", line 16, in <module>
    merged = esn_datafeed_df.merge(esn_inter_intra_merge_df, how='left', indicator=True)
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\frame.py", line 7336, in merge
    return merge(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 83, in merge
    return op.get_result()
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 642, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 859, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 837, in _get_join_indexers
    return _get_join_indexers(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1320, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas/_libs/join.pyx", line 97, in pandas._libs.join.left_outer_join
MemoryError: Unable to allocate 298. MiB for an array with shape (39003243,) and data type int64

How can I solve that? Hope someone to help me to solve this

Upvotes: 0

Views: 96

Answers (1)

arnaud
arnaud

Reputation: 3483

You're looking for a LEFT JOIN. You can do that using function merge in Pandas. Using indicator=True lets you see which values are only in one dataframe (as identified by indicator variable 'left_only') which is what you were looking for.

>>> df1 = pd.DataFrame([1, 2, 3, 4, 5, 6], columns=["my_column"])
>>> df2 = pd.DataFrame([1, 2, 3], columns=["my_column"])
>>> merged = df1.merge(df2, how="left", indicator=True)
>>> merged.query("_merge == 'left_only'")[["my_column"]]

    my_column
3   4
4   5
5   6 

Upvotes: 2

Related Questions