Divyansh Raj
Divyansh Raj

Reputation: 37

Merge two dataframes based on column containing set

i have two dataframes and one of the columns in both contains a set . now if both sets are equal i want to merge the dataframes based on that column.

i tried using df.merge as

les=df_report_2.merge(df_report_1,how='inner',on='saltids')

df_report_1 and df_report_2 are the dataframe and saltids is thecolumn which consists of a set datatype.

i get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-c8ca93f2c1e8> in <module>()
----> 1 les=df_report_2.merge(df_report_1,how='inner',on='saltids')

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
   6866                      right_on=right_on, left_index=left_index,
   6867                      right_index=right_index, sort=sort, suffixes=suffixes,
-> 6868                      copy=copy, indicator=indicator, validate=validate)
   6869 
   6870     def round(self, decimals=0, *args, **kwargs):

/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     46                          copy=copy, indicator=indicator,
     47                          validate=validate)
---> 48     return op.get_result()
     49 
     50 

/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc in get_result(self)
    544                 self.left, self.right)
    545 
--> 546         join_index, left_indexer, right_indexer = self._get_join_info()
    547 
    548         ldata, rdata = self.left._data, self.right._data

/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc in _get_join_info(self)
    754         else:
    755             (left_indexer,
--> 756              right_indexer) = self._get_join_indexers()
    757 
    758             if self.right_index:

/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc in _get_join_indexers(self)
    733                                   self.right_join_keys,
    734                                   sort=self.sort,
--> 735                                   how=self.how)
    736 
    737     def _get_join_info(self):

/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
   1128 
   1129     # get left & right join labels and num. of levels at each location
-> 1130     llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
   1131 
   1132     # get flat i8 keys from label lists

TypeError: type object argument after * must be an iterable, not itertools.imap

Upvotes: 1

Views: 284

Answers (1)

mukulgarg94
mukulgarg94

Reputation: 51

As a workaround you can probably join on hashed values of the columns.

Since sets cannot be hashed, you need to convert to frozen set and then calculate hash.

Code would look like

df_report_1['saltids_hashed'] = df_report_1['saltids'].map(lambda x: hash(frozenset(x)))

df_report_2['saltids_hashed'] = df_report_2['saltids'].map(lambda x: hash(frozenset(x)))

les=df_report_2.merge(df_report_1,how='inner',on='saltids_hashed')

You can obviously drop the hashed column in the final output.

Upvotes: 1

Related Questions