Pilik
Pilik

Reputation: 709

Combining two datasets to form a boolean column (pandas)

I have two DataFrames in pandas:

dfm_one

    data    group_a group_b
0   3   a   z
1   1   a   z
2   2   b   x
3   0   b   x
4   0   b   x
5   1   b   z
6   0   c   x
7   0   c   y
8   3   c   z
9   3   c   z

dfm_two

    data    group_a group_b
0   4   a   x
1   4   a   y
2   4   b   x
3   4   b   x
4   4   b   y
5   1   b   y
6   1   b   z
7   1   c   x
8   4   c   y
9   3   c   z
10  2   c   z

As output I want a boolean column that indicates for dfm_one whether there is a matching data entry (i.e. has the same vale) in dfm_two for each group_a group_b combination.

So my expected output is:

0   False
1   False
2   False
3   False
4   False
5   True
6   False
7   False
8   True
9   True

I'm guessing the code should look something like:

dfm_one.groupby(['group_a','group_b']).apply(lambda x: ??)

and that the function inside apply should make use of the isin method.

Another solution might be to merge the two datasets but I think this is not trivial since there is no unique identifier in the DataFrame.

Upvotes: 1

Views: 83

Answers (1)

EdChum
EdChum

Reputation: 394099

OK this is a slight hack, if we cast the df to str dtype then we can call sum to concatenate the rows into a string, we can use the resultant string as a kind of unique identifier and then call isin on the other df, again converting to a str:

In [91]:
dfm_one.astype(str).sum(axis=1).isin(dfm_two.astype(str).sum(axis=1))

Out[91]:
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7    False
8     True
9     True
dtype: bool

Output from the conversions:

In [92]:
dfm_one.astype(str).sum(axis=1)

Out[92]:
0    3az
1    1az
2    2bx
3    0bx
4    0bx
5    1bz
6    0cx
7    0cy
8    3cz
9    3cz
dtype: object

In [93]:   
dfm_two.astype(str).sum(axis=1)

Out[93]:
0     4ax
1     4ay
2     4bx
3     4bx
4     4by
5     1by
6     1bz
7     1cx
8     4cy
9     3cz
10    2cz
dtype: object

Upvotes: 1

Related Questions