Reputation: 6836
Assume I have two dataframes of this format (call them df1
and df2
):
+------------------------+------------------------+--------+
| user_id | business_id | rating |
+------------------------+------------------------+--------+
| rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA | 4 |
| C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA | 5 |
| mlBC3pN9GXlUUfQi1qBBZA | KoIRdcIfh3XWxiCeV1BDmA | 3 |
+------------------------+------------------------+--------+
I'm looking to get a dataframe of all the rows that have a common user_id
in df1
and df2
. (ie. if a user_id
is in both df1
and df2
, include the two rows in the output dataframe)
I can think of many ways to approach this, but they all strike me as clunky. For example, we could find all the unique user_id
s in each dataframe, create a set of each, find their intersection, filter the two dataframes with the resulting set and concatenate the two filtered dataframes.
Maybe that's the best approach, but I know Pandas is clever. Is there a simpler way to do this? I've looked at merge
but I don't think that's what I need.
Upvotes: 98
Views: 206320
Reputation: 502
You can do this for n
DataFrames and k
colums by using pd.Index.intersection
:
import pandas as pd
from functools import reduce
from typing import Union
def dataframe_intersection(
dataframes: list[pd.DataFrame], by: Union[list, str]
) -> list[pd.DataFrame]:
set_index = [d.set_index(by) for d in dataframes]
index_intersection = reduce(pd.Index.intersection, [d.index for d in set_index])
intersected = [df.loc[index_intersection].reset_index() for df in set_index]
return intersected
df1 = pd.DataFrame({"user_id":[1,2,3], "business_id": ['a', 'b', 'c'], "rating":[10, 15, 20]})
df2 = pd.DataFrame({"user_id":[3,4,5], "business_id": ['c', 'd', 'e'], "rating":[30, 35, 40]})
df3 = pd.DataFrame({"user_id":[3,3,3], "business_id": ['f', 'c', 'f'], "rating":[50, 70, 80]})
df_list = [df1, df2, df3]
This gives
>>> pd.concat(dataframe_intersection(df_list, by='user_id'))
user_id business_id rating
0 3 c 20
0 3 c 30
0 3 f 50
1 3 c 70
2 3 f 80
And
>>> pd.concat(dataframe_intersection(df_list, by=['user_id', 'business_id']))
user_id business_id rating
0 3 c 20
0 3 c 30
0 3 c 70
Upvotes: 2
Reputation: 1549
My understanding is that this question is better answered over in this post.
But briefly, the answer to the OP with this method is simply:
s1 = pd.merge(df1, df2, how='inner', on=['user_id'])
Which gives s1
with 5 columns: user_id
and the other two columns from each of df1
and df2
.
Upvotes: 137
Reputation: 25662
If I understand you correctly, you can use a combination of Series.isin()
and DataFrame.append()
:
In [80]: df1
Out[80]:
rating user_id
0 2 0x21abL
1 1 0x21abL
2 1 0xdafL
3 0 0x21abL
4 4 0x1d14L
5 2 0x21abL
6 1 0x21abL
7 0 0xdafL
8 4 0x1d14L
9 1 0x21abL
In [81]: df2
Out[81]:
rating user_id
0 2 0x1d14L
1 1 0xdbdcad7
2 1 0x21abL
3 3 0x21abL
4 3 0x21abL
5 1 0x5734a81e2
6 2 0x1d14L
7 0 0xdafL
8 0 0x1d14L
9 4 0x5734a81e2
In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)
In [83]: ind
Out[83]:
0 True
1 False
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 False
Name: user_id, dtype: bool
In [84]: df1[ind].append(df2[ind])
Out[84]:
rating user_id
0 2 0x21abL
2 1 0xdafL
3 0 0x21abL
4 4 0x1d14L
6 1 0x21abL
7 0 0xdafL
8 4 0x1d14L
0 2 0x1d14L
2 1 0x21abL
3 3 0x21abL
4 3 0x21abL
6 2 0x1d14L
7 0 0xdafL
8 0 0x1d14L
This is essentially the algorithm you described as "clunky", using idiomatic pandas
methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1
and df2
have no overlapping row indices, i.e., if
In [93]: df1.index & df2.index
Out[93]: Int64Index([], dtype='int64')
In fact, it won't give the expected output if their row indices are not equal.
Upvotes: 19
Reputation: 117380
In SQL, this problem could be solved by several methods:
select * from df1 where exists (select * from df2 where df2.user_id = df1.user_id)
union all
select * from df2 where exists (select * from df1 where df1.user_id = df2.user_id)
or join and then unpivot (possible in SQL server)
select
df1.user_id,
c.rating
from df1
inner join df2 on df2.user_i = df1.user_id
outer apply (
select df1.rating union all
select df2.rating
) as c
Second one could be written in pandas with something like:
>>> df1 = pd.DataFrame({"user_id":[1,2,3], "rating":[10, 15, 20]})
>>> df2 = pd.DataFrame({"user_id":[3,4,5], "rating":[30, 35, 40]})
>>>
>>> df4 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df = pd.merge(df1, df2, on='user_id', suffixes=['_1', '_2'])
>>> df3 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df4 = df[['user_id', 'rating_2']].rename(columns={'rating_2':'rating'})
>>> pd.concat([df3, df4], axis=0)
user_id rating
0 3 20
0 3 30
Upvotes: 5