mxdbld
mxdbld

Reputation: 17765

Duplicating & modifying rows in pandas based on columns condition

As part of a classification problem, I work on a DataFrame containing multiple label columns.

My dataframe is of this form :

df = pd.DataFrame([['a', 1, 1],
                   ['b', 1, 0],
                   ['c', 0, 0]] , columns=['col1', 'label1', 'label2'])

>>>   col1  label1  label2
    0    a       1       1
    1    b       1       0
    2    c       0       0

As I do not want to have more than one true label per row, I want to duplicate only those rows and regularize this condition as follows :

>>>   col1  label1  label2
    0    a       1       0 # Modified original row
    1    a       0       1 # Duplicated & modified row
    2    b       1       0
    3    c       0       0

With only the row of value "a" being duplicated / regularized

At the moment I do that in a for loop, replicating the rows in a second DataFrame, appending it and dropping all the "invalid" rows.

Would there be a more clean/efficient way to do that ?

Upvotes: 2

Views: 536

Answers (5)

Dmitriy K.
Dmitriy K.

Reputation: 136

Split into 2 df - unique and duplicates. For duplicates took col1 + label1 columns and concat with col1 + label2 and fill nan with 0. Concat unique and duplicates df into one:

df = pd.DataFrame([['a', 1, 1],
                   ['b', 1, 0],
                   ['c', 0, 0]], columns=['col1', 'label1', 'label2'])

mask = (df['label1'] == 1) & (df['label2'] == 1)
df_dup, df_uq = df[mask], df[~mask]
df_dup = pd.concat([df_dup[['col1', x]] for x in df_dup.columns if x != 'col1']).fillna(0)
df = pd.concat([df_dup, df_uq], ignore_index=True)
print(df)

  col1  label1  label2
0    a     1.0     0.0
1    a     0.0     1.0
2    b     1.0     0.0
3    c     0.0     0.0

Upvotes: 1

Ted Petrou
Ted Petrou

Reputation: 62037

Here is a somewhat intuitive way of thinking about the problem. First, filter for just the rows that have label both equal to 1. Make two new dataframes by replacing each column by zero, once each.

Then concatenate the original dataframe without both rows equal to one to the two new dataframes created.

mask_ones = (df['label1'] == 1) & (df['label2'] == 1)
df_ones = df[mask_ones]
df_not_ones = df[~mask_ones]
df_final = pd.concat([df_not_ones, 
                      df_ones.replace({'label2':{1:0}}),
                      df_ones.replace({'label1':{1:0}})]).sort_values('col1')

Upvotes: 1

MMF
MMF

Reputation: 5921

You can also use df.iterrows() doing as follows :

for index, row in df.iterrows():
    if row[1]+row[2]==2:
        df = pd.concat((df, pd.DataFrame({'col1':[row[0]], 'label1':[0], 'label2':[1]})),ignore_index=True)
        df = pd.concat((df, pd.DataFrame({'col1':[row[0]], 'label1':[1], 'label2':[0]})), ignore_index=True)
        df.drop(index, inplace=True)

Result :

  col1  label1  label2
1    b       1       0
2    c       0       0
3    a       0       1
4    a       1       0

Then you can sort regarding values on col1

Upvotes: 1

roman
roman

Reputation: 117606

>>> cols = [x for x in df.columns is x != 'col1']
>>> res = pd.concat([df[['col1', x]] for x in cols])
>>> res = res.drop_duplicates()
>>> res.fillna(0, inplace=True)
>>> res.sort_values(by='col1', inplace=True)
>>> res.reset_index(drop=True, inplace=True)
>>> res
  col1  label1  label2
0    a       1       0
1    a       0       1
2    b       1       0
3    b       0       0
4    c       0       0

Upvotes: 1

RaJa
RaJa

Reputation: 1567

Something like that:

df = pd.DataFrame([['a', 1, 1],
                   ['b', 1, 0],
                   ['c', 0, 0]] , columns=['col1', 'label1', 'label2'])
df2 = pd.DataFrame()
df2["col1"] = df["col1"]
df2["label2"] = df["label2"]
df.drop(labels="label2", axis=1, inplace=True)
result = df.append(df2, ignore_index=True)
result.fillna(value=0, inplace=True)
result.sort_values(by="col1")

Result:

  col1   label1   label2
0    a 1.000000 0.000000
3    a 0.000000 1.000000
1    b 1.000000 0.000000
4    b 0.000000 0.000000
2    c 0.000000 0.000000
5    c 0.000000 0.000000

Finally, you could drop duplicates

result.drop_duplicates()

Upvotes: 0

Related Questions