Reputation: 572
Suppose, a DataFrame is in the form:
column1 column2 is_duplicate
0 xyz XYZ 1
1 xyz XyZ 1
2 abc ABC 1
3 abc aBc 1
How to perform a Cartesian Product on column1
and column2
in such a way that newly created rows will have a value 0 while the original rows will still have 1 in is_duplicate
column?
Expected DataFrame after output:
column1 column2 is_duplicate
0 xyz XYZ 1
1 xyz XyZ 1
2 xyz ABC 0
3 xyz aBc 0
4 abc XYZ 0
5 abc XyZ 0
6 abc ABC 1
7 abc aBc 1
Upvotes: 5
Views: 299
Reputation: 879661
You could use pd.MultiIndex.from_product
to form the cartesian product.
Since this is an index, you could pass it to df.reindex
to expand the DataFrame
to include a row for each value from the index:
import numpy as np
import pandas as pd
df = pd.DataFrame({'column1': ['xyz', 'xyz', 'abc', 'abc'],
'column2': ['XYZ', 'XyZ', 'ABC', 'aBc'],
'is_duplicate': [1, 1, 1, 1]})
cols = ['column1', 'column2']
index = pd.MultiIndex.from_product([df[col].unique() for col in cols],
names=cols)
result = df.set_index(['column1','column2']).reindex(index, fill_value=0).reset_index()
print(result)
yields
column1 column2 is_duplicate
0 xyz XYZ 1
1 xyz XyZ 1
2 xyz ABC 0
3 xyz aBc 0
4 abc XYZ 0
5 abc XyZ 0
6 abc ABC 1
7 abc aBc 1
Upvotes: 4