Reputation: 1622
I have created a pandas dataframe as follows:
ds = {'col1' : ["A","B"], 'probability' : [0.3, 0.6]}
df = pd.DataFrame(data=ds)
The dataframe looks like this:
print(df)
col1 probability
0 A 0.3
1 B 0.6
I need to create a new dataframe which duplicates each row and assign to the duplicated record a probability needed to sum up to 1.
From the example above:
The resulting dataframe looks like this:
col1 probability
0 A 0.3
1 A 0.7
2 B 0.6
3 B 0.4
Can anyone help me doing it in pandas, please?
Upvotes: 0
Views: 56
Reputation: 25438
A possible solution, which first creates a second dataframe with col1
and 1-probability
. Then, it vertically concatenates the second dataframe to the original one, and sorts by col1
:
pd.concat([
df,
pd.DataFrame(zip(df['col1'], 1-df['probability']), columns=df.columns)])
.sort_values('col1')
Alternatively (this might be preferred), as @mozway suggests in a comment below:
(pd.concat([
df,
pd.DataFrame({'col1': df['col1'], 'probability': 1-df['probability']})])
.sort_values('col1'))
Another approach, using list comprehension
and numpy
:
pd.DataFrame(
np.vstack([np.array([[x, y], [x, 1-y]])
for x, y in zip(df['col1'], df['probability'])]),
columns=df.columns)
Output:
col1 probability
0 A 0.3
0 A 0.7
1 B 0.6
1 B 0.4
Upvotes: 3
Reputation: 7530
You can use this:
df = pd.concat([df, df.assign(probability=1 - df["probability"])], ignore_index=True)
col1 probability
0 A 0.3
1 B 0.6
2 A 0.7
3 B 0.4
Upvotes: 1