K. K.
K. K.

Reputation: 572

Cartesian product of columns of a DataFrame and setting newly created rows to 0 in Python

Suppose, a DataFrame is in the form:

   column1  column2  is_duplicate
0   xyz      XYZ         1
1   xyz      XyZ         1
2   abc      ABC         1
3   abc      aBc         1

How to perform a Cartesian Product on column1 and column2 in such a way that newly created rows will have a value 0 while the original rows will still have 1 in is_duplicate column?

Expected DataFrame after output:

   column1  column2  is_duplicate
0   xyz      XYZ         1
1   xyz      XyZ         1
2   xyz      ABC         0
3   xyz      aBc         0
4   abc      XYZ         0
5   abc      XyZ         0
6   abc      ABC         1
7   abc      aBc         1

Upvotes: 5

Views: 299

Answers (1)

unutbu
unutbu

Reputation: 879661

You could use pd.MultiIndex.from_product to form the cartesian product. Since this is an index, you could pass it to df.reindex to expand the DataFrame to include a row for each value from the index:

import numpy as np 
import pandas as pd

df = pd.DataFrame({'column1': ['xyz', 'xyz', 'abc', 'abc'],
                   'column2': ['XYZ', 'XyZ', 'ABC', 'aBc'],
                   'is_duplicate': [1, 1, 1, 1]})

cols = ['column1', 'column2']
index = pd.MultiIndex.from_product([df[col].unique() for col in cols],
                                   names=cols)
result = df.set_index(['column1','column2']).reindex(index, fill_value=0).reset_index()
print(result)

yields

  column1 column2  is_duplicate
0     xyz     XYZ             1
1     xyz     XyZ             1
2     xyz     ABC             0
3     xyz     aBc             0
4     abc     XYZ             0
5     abc     XyZ             0
6     abc     ABC             1
7     abc     aBc             1

Upvotes: 4

Related Questions