Reputation: 2449
I'd like to find values that are in a dataframe, and not in another one. For example I have a dataframe like this:
|---------------------|
| Col1 |
|---------------------|
| 1111 |
|---------------------|
| 2222 |
|---------------------|
| 3333 |
|---------------------|
| 4444 |
|---------------------|
| 5555 |
|---------------------|
| 6666 |
|---------------------|
| 7777 |
|---------------------|
and another dataframe like this:
|---------------------|
| Col2 |
|---------------------|
| 1111 |
|---------------------|
| 2222 |
|---------------------|
| 3333 |
|---------------------|
| 4444 |
|---------------------|
I want a result something like this
|---------------------|
| missing values |
|---------------------|
| 5555 |
|---------------------|
| 6666 |
|---------------------|
| 7777 |
|---------------------|
esn_datafeed_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_datafeed]', engine)
esn_inter_intra_merge_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_inter_intra_merge]', engine)
merged = esn_datafeed_df.merge(esn_inter_intra_merge_df, how='left', indicator=True)
merged.query("_merge == 'left_only'")[["st_umts_df_relation_key"]]
Error:
Traceback (most recent call last):
File "C:/Users/haroo501/PycharmProjects/tool_check_nbr/my_missing_result.py", line 16, in <module>
merged = esn_datafeed_df.merge(esn_inter_intra_merge_df, how='left', indicator=True)
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\frame.py", line 7336, in merge
return merge(
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 83, in merge
return op.get_result()
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 642, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 859, in _get_join_info
(left_indexer, right_indexer) = self._get_join_indexers()
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 837, in _get_join_indexers
return _get_join_indexers(
File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1320, in _get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas/_libs/join.pyx", line 97, in pandas._libs.join.left_outer_join
MemoryError: Unable to allocate 298. MiB for an array with shape (39003243,) and data type int64
How can I solve that? Hope someone to help me to solve this
Upvotes: 0
Views: 96
Reputation: 3483
You're looking for a LEFT JOIN. You can do that using function merge
in Pandas. Using indicator=True
lets you see which values are only in one dataframe (as identified by indicator variable 'left_only'
) which is what you were looking for.
>>> df1 = pd.DataFrame([1, 2, 3, 4, 5, 6], columns=["my_column"])
>>> df2 = pd.DataFrame([1, 2, 3], columns=["my_column"])
>>> merged = df1.merge(df2, how="left", indicator=True)
>>> merged.query("_merge == 'left_only'")[["my_column"]]
my_column
3 4
4 5
5 6
Upvotes: 2