Reputation: 4132
I have two pandas dataframes:
from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})
What is the best practice to get their cartesian product (of course without writing it explicitly like me)?
#df1, df2 cartesian product
df_cartesian = DataFrame({'col1':[1,2,1,2],'col2':[3,4,3,4],'col3':[5,5,6,6]})
Upvotes: 184
Views: 210901
Reputation: 10881
Use pd.MultiIndex.from_product
as an index in an otherwise empty dataframe, then reset its index, and you're done.
a = [1, 2, 3]
b = ["a", "b", "c"]
index = pd.MultiIndex.from_product([a, b], names = ["a", "b"])
pd.DataFrame(index = index).reset_index()
out:
a b
0 1 a
1 1 b
2 1 c
3 2 a
4 2 b
5 2 c
6 3 a
7 3 b
8 3 c
Upvotes: 112
Reputation: 2112
If you want to take the cross product of two Series or DataFrames in such a way that the result is indexed correctly by the cross product of their two respective indices, that can be done like this:
def indexed_cross_product(df1, df2):
assert df1.index.name is not None
assert df2.index.name is not None
assert df1.index.name != df2.index.name
vals = df1.reset_index().merge(df2.reset_index(), how="cross")
return vals.set_index([df1.index.name, df2.index.name], drop=True)
Upvotes: 0
Reputation: 28649
You could use expand_grid from pyjanitor to replicate a cross join; it offers some speed performance for larger datasets (it uses np.meshgrid
underneath):
pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
jn.expand_grid(others = {"df1":df1, "df2":df2})
df1 df2
col1 col2 col3
0 1 3 5
1 1 3 6
2 2 4 5
3 2 4 6
Upvotes: 0
Reputation: 15211
Yet another workaround for the current version of Pandas (1.1.5): this one is particularly useful if you're starting off with a non-dataframe sequence. I haven't timed it. It does not require any artificial index manipulation, but does require that you repeat the second sequence. It relies on a special property of explode
, namely that the right-hand index is repeated.
df1 = DataFrame({'col1': [1,2], 'col2': [3,4]})
series2 = Series(
[[5, 6]]*len(df1),
name='col3',
index=df1.index,
)
df_cartesian = df1.join(series2.explode())
This outputs
col1 col2 col3
0 1 3 5
0 1 3 6
1 2 4 5
1 2 4 6
Upvotes: 0
Reputation: 20467
In recent versions of Pandas (>= 1.2) this is built into merge
so you can do:
from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})
df1.merge(df2, how='cross')
This is equivalent to the previous pandas < 1.2 answer but is easier to read.
For pandas < 1.2:
If you have a key that is repeated for each row, then you can produce a cartesian product using merge (like you would in SQL).
from pandas import DataFrame, merge
df1 = DataFrame({'key':[1,1], 'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'key':[1,1], 'col3':[5,6]})
merge(df1, df2,on='key')[['col1', 'col2', 'col3']]
Output:
col1 col2 col3
0 1 3 5
1 1 3 6
2 2 4 5
3 2 4 6
See here for the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html
Upvotes: 197
Reputation: 402253
Presenting to you
left.merge(right, how='cross')
import pandas as pd
pd.__version__
# '1.2.0'
left = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
right = pd.DataFrame({'col3': [5, 6]})
left.merge(right, how='cross')
col1 col2 col3
0 1 3 5
1 1 3 6
2 2 4 5
3 2 4 6
Indexes are ignored in the result.
Implementation wise, this uses the join on common key column method as described in the accepted answer. The upsides of using the API is that it saves you a lot of typing and handles some corner cases pretty well. I'd almost always recommend this syntax as my first preference for cartesian product in pandas unless you're looking for something more performant.
Upvotes: 19
Reputation: 43612
Here is a helper function to perform a simple Cartesian product with two data frames. The internal logic handles using an internal key, and avoids mangling any columns that happen to be named "key" from either side.
import pandas as pd
def cartesian(df1, df2):
"""Determine Cartesian product of two data frames."""
key = 'key'
while key in df1.columns or key in df2.columns:
key = '_' + key
key_d = {key: 0}
return pd.merge(
df1.assign(**key_d), df2.assign(**key_d), on=key).drop(key, axis=1)
# Two data frames, where the first happens to have a 'key' column
df1 = pd.DataFrame({'number':[1, 2], 'key':[3, 4]})
df2 = pd.DataFrame({'digit': [5, 6]})
cartesian(df1, df2)
shows:
number key digit
0 1 3 5
1 1 3 6
2 2 4 5
3 2 4 6
Upvotes: 2
Reputation: 15773
You could start by taking the Cartesian product of df1.col1
and df2.col3
, then merge back to df1
to get col2
.
Here's a general Cartesian product function which takes a dictionary of lists:
def cartesian_product(d):
index = pd.MultiIndex.from_product(d.values(), names=d.keys())
return pd.DataFrame(index=index).reset_index()
Apply as:
res = cartesian_product({'col1': df1.col1, 'col3': df2.col3})
pd.merge(res, df1, on='col1')
# col1 col3 col2
# 0 1 5 3
# 1 1 6 3
# 2 2 5 4
# 3 2 6 4
Upvotes: 1
Reputation: 23980
With method chaining:
product = (
df1.assign(key=1)
.merge(df2.assign(key=1), on="key")
.drop("key", axis=1)
)
Upvotes: 25
Reputation: 7893
Minimal code needed for this one. Create a common 'key' to cartesian merge the two:
df1['key'] = 0
df2['key'] = 0
df_cartesian = df1.merge(df2, how='outer')
Upvotes: 46
Reputation: 1057
I find using pandas MultiIndex to be the best tool for the job. If you have a list of lists lists_list
, call pd.MultiIndex.from_product(lists_list)
and iterate over the result (or use it in DataFrame index).
Upvotes: -1
Reputation: 528
This won't win a code golf competition, and borrows from the previous answers - but clearly shows how the key is added, and how the join works. This creates 2 new data frames from lists, then adds the key to do the cartesian product on.
My use case was that I needed a list of all store IDs on for each week in my list. So, I created a list of all the weeks I wanted to have, then a list of all the store IDs I wanted to map them against.
The merge I chose left, but would be semantically the same as inner in this setup. You can see this in the documentation on merging, which states it does a Cartesian product if key combination appears more than once in both tables - which is what we set up.
days = pd.DataFrame({'date':list_of_days})
stores = pd.DataFrame({'store_id':list_of_stores})
stores['key'] = 0
days['key'] = 0
days_and_stores = days.merge(stores, how='left', on = 'key')
days_and_stores.drop('key',1, inplace=True)
Upvotes: 38
Reputation: 7170
As an alternative, one can rely on the cartesian product provided by itertools: itertools.product
, which avoids creating a temporary key or modifying the index:
import numpy as np
import pandas as pd
import itertools
def cartesian(df1, df2):
rows = itertools.product(df1.iterrows(), df2.iterrows())
df = pd.DataFrame(left.append(right) for (_, left), (_, right) in rows)
return df.reset_index(drop=True)
Quick test:
In [46]: a = pd.DataFrame(np.random.rand(5, 3), columns=["a", "b", "c"])
In [47]: b = pd.DataFrame(np.random.rand(5, 3), columns=["d", "e", "f"])
In [48]: cartesian(a,b)
Out[48]:
a b c d e f
0 0.436480 0.068491 0.260292 0.991311 0.064167 0.715142
1 0.436480 0.068491 0.260292 0.101777 0.840464 0.760616
2 0.436480 0.068491 0.260292 0.655391 0.289537 0.391893
3 0.436480 0.068491 0.260292 0.383729 0.061811 0.773627
4 0.436480 0.068491 0.260292 0.575711 0.995151 0.804567
5 0.469578 0.052932 0.633394 0.991311 0.064167 0.715142
6 0.469578 0.052932 0.633394 0.101777 0.840464 0.760616
7 0.469578 0.052932 0.633394 0.655391 0.289537 0.391893
8 0.469578 0.052932 0.633394 0.383729 0.061811 0.773627
9 0.469578 0.052932 0.633394 0.575711 0.995151 0.804567
10 0.466813 0.224062 0.218994 0.991311 0.064167 0.715142
11 0.466813 0.224062 0.218994 0.101777 0.840464 0.760616
12 0.466813 0.224062 0.218994 0.655391 0.289537 0.391893
13 0.466813 0.224062 0.218994 0.383729 0.061811 0.773627
14 0.466813 0.224062 0.218994 0.575711 0.995151 0.804567
15 0.831365 0.273890 0.130410 0.991311 0.064167 0.715142
16 0.831365 0.273890 0.130410 0.101777 0.840464 0.760616
17 0.831365 0.273890 0.130410 0.655391 0.289537 0.391893
18 0.831365 0.273890 0.130410 0.383729 0.061811 0.773627
19 0.831365 0.273890 0.130410 0.575711 0.995151 0.804567
20 0.447640 0.848283 0.627224 0.991311 0.064167 0.715142
21 0.447640 0.848283 0.627224 0.101777 0.840464 0.760616
22 0.447640 0.848283 0.627224 0.655391 0.289537 0.391893
23 0.447640 0.848283 0.627224 0.383729 0.061811 0.773627
24 0.447640 0.848283 0.627224 0.575711 0.995151 0.804567
Upvotes: 16
Reputation: 37
If you have no overlapping columns, don't want to add one, and the indices of the data frames can be discarded, this may be easier:
df1.index[:] = df2.index[:] = 0
df_cartesian = df1.join(df2, how='outer')
df_cartesian.index[:] = range(len(df_cartesian))
Upvotes: 2