Reputation: 17765
As part of a classification problem, I work on a DataFrame containing multiple label columns.
My dataframe is of this form :
df = pd.DataFrame([['a', 1, 1],
['b', 1, 0],
['c', 0, 0]] , columns=['col1', 'label1', 'label2'])
>>> col1 label1 label2
0 a 1 1
1 b 1 0
2 c 0 0
As I do not want to have more than one true label per row, I want to duplicate only those rows and regularize this condition as follows :
>>> col1 label1 label2
0 a 1 0 # Modified original row
1 a 0 1 # Duplicated & modified row
2 b 1 0
3 c 0 0
With only the row of value "a" being duplicated / regularized
At the moment I do that in a for loop, replicating the rows in a second DataFrame, appending it and dropping all the "invalid" rows.
Would there be a more clean/efficient way to do that ?
Upvotes: 2
Views: 536
Reputation: 136
Split into 2 df - unique and duplicates. For duplicates took col1 + label1 columns and concat with col1 + label2 and fill nan with 0. Concat unique and duplicates df into one:
df = pd.DataFrame([['a', 1, 1],
['b', 1, 0],
['c', 0, 0]], columns=['col1', 'label1', 'label2'])
mask = (df['label1'] == 1) & (df['label2'] == 1)
df_dup, df_uq = df[mask], df[~mask]
df_dup = pd.concat([df_dup[['col1', x]] for x in df_dup.columns if x != 'col1']).fillna(0)
df = pd.concat([df_dup, df_uq], ignore_index=True)
print(df)
col1 label1 label2
0 a 1.0 0.0
1 a 0.0 1.0
2 b 1.0 0.0
3 c 0.0 0.0
Upvotes: 1
Reputation: 62037
Here is a somewhat intuitive way of thinking about the problem. First, filter for just the rows that have label both equal to 1. Make two new dataframes by replacing each column by zero, once each.
Then concatenate the original dataframe without both rows equal to one to the two new dataframes created.
mask_ones = (df['label1'] == 1) & (df['label2'] == 1)
df_ones = df[mask_ones]
df_not_ones = df[~mask_ones]
df_final = pd.concat([df_not_ones,
df_ones.replace({'label2':{1:0}}),
df_ones.replace({'label1':{1:0}})]).sort_values('col1')
Upvotes: 1
Reputation: 5921
You can also use df.iterrows()
doing as follows :
for index, row in df.iterrows():
if row[1]+row[2]==2:
df = pd.concat((df, pd.DataFrame({'col1':[row[0]], 'label1':[0], 'label2':[1]})),ignore_index=True)
df = pd.concat((df, pd.DataFrame({'col1':[row[0]], 'label1':[1], 'label2':[0]})), ignore_index=True)
df.drop(index, inplace=True)
Result :
col1 label1 label2
1 b 1 0
2 c 0 0
3 a 0 1
4 a 1 0
Then you can sort regarding values on col1
Upvotes: 1
Reputation: 117606
>>> cols = [x for x in df.columns is x != 'col1']
>>> res = pd.concat([df[['col1', x]] for x in cols])
>>> res = res.drop_duplicates()
>>> res.fillna(0, inplace=True)
>>> res.sort_values(by='col1', inplace=True)
>>> res.reset_index(drop=True, inplace=True)
>>> res
col1 label1 label2
0 a 1 0
1 a 0 1
2 b 1 0
3 b 0 0
4 c 0 0
Upvotes: 1
Reputation: 1567
Something like that:
df = pd.DataFrame([['a', 1, 1],
['b', 1, 0],
['c', 0, 0]] , columns=['col1', 'label1', 'label2'])
df2 = pd.DataFrame()
df2["col1"] = df["col1"]
df2["label2"] = df["label2"]
df.drop(labels="label2", axis=1, inplace=True)
result = df.append(df2, ignore_index=True)
result.fillna(value=0, inplace=True)
result.sort_values(by="col1")
Result:
col1 label1 label2
0 a 1.000000 0.000000
3 a 0.000000 1.000000
1 b 1.000000 0.000000
4 b 0.000000 0.000000
2 c 0.000000 0.000000
5 c 0.000000 0.000000
Finally, you could drop duplicates
result.drop_duplicates()
Upvotes: 0