Giampaolo Levorato
Giampaolo Levorato

Reputation: 1622

How to duplicate records in pandas dataframe based on column values

I have created a pandas dataframe as follows:

ds = {'col1' : ["A","B"], 'probability' : [0.3, 0.6]}
df = pd.DataFrame(data=ds)

The dataframe looks like this:

print(df)   
  col1  probability
0    A   0.3
1    B   0.6

I need to create a new dataframe which duplicates each row and assign to the duplicated record a probability needed to sum up to 1.

From the example above:

The resulting dataframe looks like this:

  col1  probability
0    A          0.3
1    A          0.7
2    B          0.6
3    B          0.4

Can anyone help me doing it in pandas, please?

Upvotes: 0

Views: 56

Answers (2)

PaulS
PaulS

Reputation: 25438

A possible solution, which first creates a second dataframe with col1 and 1-probability. Then, it vertically concatenates the second dataframe to the original one, and sorts by col1:

pd.concat([
    df, 
    pd.DataFrame(zip(df['col1'], 1-df['probability']), columns=df.columns)])
.sort_values('col1')

Alternatively (this might be preferred), as @mozway suggests in a comment below:

(pd.concat([
    df, 
    pd.DataFrame({'col1': df['col1'], 'probability': 1-df['probability']})])
 .sort_values('col1'))

Another approach, using list comprehension and numpy:

pd.DataFrame(
    np.vstack([np.array([[x, y], [x, 1-y]]) 
               for x, y in zip(df['col1'], df['probability'])]), 
    columns=df.columns)

Output:

  col1  probability
0    A          0.3
0    A          0.7
1    B          0.6
1    B          0.4

Upvotes: 3

e-motta
e-motta

Reputation: 7530

You can use this:

df = pd.concat([df, df.assign(probability=1 - df["probability"])], ignore_index=True)
  col1  probability
0    A          0.3
1    B          0.6
2    A          0.7
3    B          0.4

Upvotes: 1

Related Questions