Reputation: 468
I have the following dummy df:
columns = ['Part', 'Task_Color', 'Task', 'Solution', 'correct_one', 'incorrect_two', 'correct_three']
data = [['16a', 'Red', 'What is on the table?', 'bottle', 'a bottle', 'nothing', 'empty bottle'],
['16b', 'Red', 'Write hi', 'hi', 'hello', 'hey', 'no'],
['16c', 'Red', 'Cats or dogs', 'both', 'cats', 'dogs', 'both'],
['17a', 'Red', 'Nice to meet you', 'bottle', 'a bottle', 'nothing', 'empty bottle'],
['17b', 'Red', 'bla bla', 'nope', 'nope', 'nope', 'nope'],
['18a', 'Blue', 'Do x and write y', 'y', 'y', '...', 'what'],
['18b', 'Blue', 'Help me', 'sure', 'ok', 'yes', 'no']]
df = pd.DataFrame(columns=columns, data=data)
The df looks like this
Part Task_Color Task Solution correct_one incorrect_two correct_three
0 16a Red What is on the table? bottle a bottle nothing emptybottle
1 16b Red Write hi hi hello hey no
2 16c Red Cats or dogs both cats dogs both
3 17a Red Nice to meet you bottle a bottle nothing empty bottle
4 17b Red bla bla nope nope nope nope
5 18a Blue Do x and write y y y ... what
6 18b Blue Help me sure ok yes no
This is the dataframe I want to create out of the df above:
Part Task_Color Task Solution answer label
16a Red What is on the table? bottle a bottle 1
16a Red What is on the table? bottle nothing 0
16a Red What is on the table? bottle empty bottle 1
16b Red Write hi hi hello 1
16b Red Write hi hi hey 0
16b Red Write hi hi no 1
16c Red Cats or dogs both cats 1
16c Red Cats or dogs both dogs 0
16c Red Cats or dogs both both 1
17a Red Nice to meet you bottle a bottle 1
17a Red Nice to meet you bottle nothing 0
17a Red Nice to meet you bottle empty bottle 1
...
I want to split every row in the original df into multiple rows so that every correct_one
, incorrect_two
and correct_three
has its own row as illustrated above. The content that was inside these three columns should be in a new column called answer
.
Moreover, I want to add a new column, label
, that is filled with a 1
if the column header includes correct and elif includes incorrect a 0
.
Upvotes: 0
Views: 45
Reputation: 120409
Another (one-line) solution with melt
>>> df.melt(df.columns[:4].tolist(), var_name="label", value_name="answer") \
.replace({"label": {"correct_one": 1, "incorrect_two": 0, "correct_three": 1}}) \
.sort_values("Part").reset_index(drop=True)
Part Task_Color Task Solution label answer
0 16a Red What is on the table? bottle 1 a bottle
1 16a Red What is on the table? bottle 1 empty bottle
2 16a Red What is on the table? bottle 0 nothing
3 16b Red Write hi hi 1 hello
4 16b Red Write hi hi 1 no
5 16b Red Write hi hi 0 hey
6 16c Red Cats or dogs both 1 cats
7 16c Red Cats or dogs both 1 both
8 16c Red Cats or dogs both 0 dogs
9 17a Red Nice to meet you bottle 1 empty bottle
10 17a Red Nice to meet you bottle 0 nothing
11 17a Red Nice to meet you bottle 1 a bottle
12 17b Red bla bla nope 0 nope
13 17b Red bla bla nope 1 nope
14 17b Red bla bla nope 1 nope
15 18a Blue Do x and write y y 0 ...
16 18a Blue Do x and write y y 1 y
17 18a Blue Do x and write y y 1 what
18 18b Blue Help me sure 0 yes
19 18b Blue Help me sure 1 ok
20 18b Blue Help me sure 1 no
Upvotes: 2
Reputation: 3276
def combine(rows):
return [
(rows.correct_one, 1),
(rows.incorrect_two, 0),
(rows.correct_three, 1)]
df['answers'] = df.apply(combine, axis=1)
df = df.explode('answers')
df[['answer','correct']] = pd.DataFrame(
df['answers'].tolist(),
index= df.index)
df.drop('answers', axis=1)
Upvotes: 1
Reputation: 28243
you can use stack
like below to reshape the data.
output_df = df.set_index(['Part', 'Task_Color', 'Task', 'Solution']).stack().rename('answer').reset_index()
then calculate the label & drop the column level_4
df['label'] = (~df.level_4.str.contains('incorrect')).astype(int)
del df['level_4']
Upvotes: 1