wkamer
wkamer

Reputation: 75

How to merge dataframes to every row in a second dataframe?

The array of data will come from other source, so in this example I have declared them as array and with minimal data. But the combinations of the entries in lists will be a lot more over million combinations. It's like the length of ar1 * arr2 * arr3 etc.

arr1 = [1, 2]
arr2 = [10, 20]
arr3 = [0.1, 0.2]

df1 = pd.DataFrame(arr1, columns=["col1"])
df2 = pd.DataFrame(arr2, columns=["col2"])
df3 = pd.DataFrame(arr3, columns=["col3"])

Final result of the new DataFrame should be all of posible combinations of the given arrays:

col1 col2 col3
1    10   0.1
1    10   0.2
1    20   0.1
1    20   0.2
2    10   0.1
2    10   0.2
2    20   0.1
2    20   0.2

Upvotes: 2

Views: 493

Answers (4)

Soudipta Dutta
Soudipta Dutta

Reputation: 2122

import pandas as pd
from sklearn.model_selection import ParameterGrid


arr1 = [1, 2]
arr2 = [10, 20]
arr3 = [0.1, 0.2]

# Create a dictionary for the parameter grid
param_grid = {
    'col1': arr1,
    'col2': arr2,
    'col3': arr3
}

# Generate all combinations of the given arrays using ParameterGrid
combinations = list(ParameterGrid(param_grid))

# Convert the combinations to a DataFrame
df_combinations = pd.DataFrame(combinations)

print(df_combinations)
   col1  col2  col3
0     1    10   0.1
1     1    10   0.2
2     1    20   0.1
3     1    20   0.2
4     2    10   0.1
5     2    10   0.2
6     2    20   0.1
7     2    20   0.2

Upvotes: 0

sammywemmy
sammywemmy

Reputation: 28644

expand_grid from pyjanitor is a fast implementation of cartesian product and uses np.meshgrid under the hood.

# pip install pyjanitor
import pandas as pd
import janitor as jn

# expand_grid requires a dictionary:

others = {"df1": df1, "df2": df2, "df3": df3}

jn.expand_grid(others = others).droplevel(1,1)

   col1  col2  col3
0     1    10   0.1
1     1    10   0.2
2     1    20   0.1
3     1    20   0.2
4     2    10   0.1
5     2    10   0.2
6     2    20   0.1
7     2    20   0.2

expand_grid can also be extended to cartesian product of dataframe and series, and even non pandas objects. It's end product though is a dataframe.

Upvotes: 0

piRSquared
piRSquared

Reputation: 294258

functools.reduce and pd.merge

Kind of slow.

import pandas as pd
from functools import reduce

reduce(
    pd.merge,
    [d.assign(dummy=1)
     for d in [df1, df2, df3]
    ]).drop('dummy', axis=1)

   col1  col2  col3
0     1    10   0.1
1     1    10   0.2
2     1    20   0.1
3     1    20   0.2
4     2    10   0.1
5     2    10   0.2
6     2    20   0.1
7     2    20   0.2

itertools.product and pd.DataFrame.itertuples

Definitely faster

import pandas as pd
from itertools import product

def tupify(d): return d.itertuples(index=False, name=None)
def sumtup(t): return sum(t, start=())

pd.DataFrame(
    list(map(sumtup, product(*map(tupify, [df1, df2, df3])))),
    columns = sum(map(list, [df1, df2, df3]), start=[])
)

   col1  col2  col3
0     1    10   0.1
1     1    10   0.2
2     1    20   0.1
3     1    20   0.2
4     2    10   0.1
5     2    10   0.2
6     2    20   0.1
7     2    20   0.2

Upvotes: 2

Rob Raymond
Rob Raymond

Reputation: 31166

https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.from_product.html effectively does what you want. Simple to then change to a dataframe

arr1 = [1, 2]
arr2 = [10, 20]
arr3 = [0.1, 0.2]

pd.DataFrame(index=pd.MultiIndex.from_product([arr1, arr2, arr3], names=["col1","col2","col3"])).reset_index()

Upvotes: 3

Related Questions