Reputation: 8704
I have the following pd.DataFrame representing constraints of an optimisation problem.
FEATURE COLOR CLASS CONSTRAINTS
0 1.0 NaN NaN 0.20
1 3.0 NaN NaN 0.20
2 1.0 1.0 NaN 0.15
3 1.0 NaN b -0.05
4 1.0 1.0 a -0.07
5 1.0 1.0 b -0.10
6 3.0 1.0 NaN 0.10
7 NaN NaN NaN 0.20
Here FEATURE
represents a categorical variable with possible values [1,2,3]
, COLOR
represents a categorical variable with possible values [1,2]
and CLASS
is another categorical variable with possible values [a,b,c,d]
.
Missing values here have the meaning "all other values". In this sense the dataframe is a compressed version of a larger dataframe encompassing all or some of the combinations of the columns categories.
What I would like to do here is to "expand" the NaN values to all possible values each individual column can represent.
For example row 0 would expand to 8 total rows, being the product of "free" features, namely COLOR
with possible values [1,2]
and CLASS with possible values `[a,b,c,d]``
new FEATURE COLOR CLASS CONSTRAINTS
0 1 1 a 0.2
1 1 1 b 0.2
2 1 2 a 0.2
3 1 2 b 0.2
4 1 1 a 0.2
5 1 1 b 0.2
6 1 2 a 0.2
7 1 2 b 0.2
How can I efficiently perform this transformation in Pandas?
Upvotes: 1
Views: 487
Reputation: 18306
Here is one way:
mapping = {'FEATURE': [1, 2, 3], 'COLOR': [1, 2], 'CLASS': ['a', 'b', 'c', 'd']}
cols = mapping.keys()
for col in cols:
df[col] = df[col].apply(lambda x: mapping[col] if pd.isna(x) else x)
df = df.explode(col)
For each column of interest, we fill the NaNs with the possible values (note that fillna
doesn't fill with lists, hence the apply
). Then we explode
the so-filled column and repeat this for all three of the columns.
We get the df
at the end as:
FEATURE COLOR CLASS CONSTRAINTS
0 1 1 a 0.20
0 1 1 b 0.20
0 1 1 c 0.20
0 1 1 d 0.20
0 1 2 a 0.20
0 1 2 b 0.20
0 1 2 c 0.20
0 1 2 d 0.20
1 3 1 a 0.20
1 3 1 b 0.20
1 3 1 c 0.20
1 3 1 d 0.20
1 3 2 a 0.20
1 3 2 b 0.20
1 3 2 c 0.20
1 3 2 d 0.20
2 1 1 a 0.15
2 1 1 b 0.15
2 1 1 c 0.15
2 1 1 d 0.15
3 1 1 b -0.05
3 1 2 b -0.05
4 1 1 a -0.07
5 1 1 b -0.10
6 3 1 a 0.10
6 3 1 b 0.10
6 3 1 c 0.10
6 3 1 d 0.10
7 1 1 a 0.20
7 1 1 b 0.20
7 1 1 c 0.20
7 1 1 d 0.20
7 1 2 a 0.20
7 1 2 b 0.20
7 1 2 c 0.20
7 1 2 d 0.20
7 2 1 a 0.20
7 2 1 b 0.20
7 2 1 c 0.20
7 2 1 d 0.20
7 2 2 a 0.20
7 2 2 b 0.20
7 2 2 c 0.20
7 2 2 d 0.20
7 3 1 a 0.20
7 3 1 b 0.20
7 3 1 c 0.20
7 3 1 d 0.20
7 3 2 a 0.20
7 3 2 b 0.20
7 3 2 c 0.20
7 3 2 d 0.20
which has 52 rows as expected.
Upvotes: 1