Reputation: 17

How to restructure dataframe to convert column values to row values based on condition

I have a dataframe with 5 columns and want to convert 2 of the columns (Chemo and Surgery) based on their values (greater than 0) to rows (diagnosis series) and add the information like the individual id and diagnosis at age to the rows.

Here is my dataframe

import pandas as pd

data = [['A-1', 'Birth', '0', '0', '0'], ['A-1', 'Lung cancer', '25', '25','25'],['A-1', 'Death', '50', '0','0'],['A-2', 'Birth', '0', '0','0'], ['A-2','Brain cancer', '12', '12','0'],['A-2', 'Skin cancer', '20','20','20'], ['A-2', 'Current age', '23', '0','0'],['A-3', 'Birth','0','0','0'], ['A-3', 'Brain cancer', '30', '0','30'], ['A-3', 'Lung cancer', '33', '33', '0'], ['A-3', 'Current age', '35', '0','0']]

df = pd.DataFrame(data, columns=["ID", "Diagnosis", "Age at Diagnosis", "Chemo", "Surgery"])
print df

I have tried to get the values where the Chemo/Surgery is greater than 0 but when I tried to add it as a row, it doesn't work.

This is what I want the end result to be.

ID     Diagnosis Age at Diagnosis
0   A-1         Birth                0
1   A-1   Lung cancer               25
2   A-1         Chemo               25
3   A-1       Surgery               25
4   A-1         Death               50
5   A-2         Birth                0
6   A-2  Brain cancer               12
7   A-2         Chemo               12
8   A-2   Skin cancer               20
9   A-2         Chemo               20
10  A-2       Surgery               20
11  A-2   Current age               23
12  A-3         Birth                0
13  A-3  Brain cancer               30
14  A-3       Surgery               30
15  A-3   Lung cancer               33
16  A-3         Chemo               33
17  A-3   Current age               35

This is one of the things I have tried:

chem = "Chemo"
try_df = (df[chem] > 1)
nd = df[try_df]
df["Diagnosis"] = df[chem]
print df

Upvotes: 1

Answers (2)

Quang Hoang

Reputation: 150735

We can melt the two columns Chemo and Surgery, then drop all the zero and concat back:

# melt the two columns
new_df = df[['ID', 'Chemo', 'Surgery']].melt(id_vars='ID', 
                                             value_name='Age at Diagnosis',
                                             var_name='Diagnosis')
# filter out the zeros
new_df = new_df[new_df['Age at Diagnosis'].ne('0')]

# concat with the original dataframe, ignoring the extra columns
new_df = pd.concat((df,new_df), sort=False, join='inner')

# sort values
new_df.sort_values(['ID','Age at Diagnosis'])

Output:

    ID      Diagnosis   Age at Diagnosis
0   A-1     Birth           0
1   A-1     Lung cancer     25
1   A-1     Chemo           25
12  A-1     Surgery         25
2   A-1     Death           50
3   A-2     Birth           0
4   A-2     Brain cancer    12
4   A-2     Chemo           12
5   A-2     Skin cancer     20
5   A-2     Chemo           20
16  A-2     Surgery         20
6   A-2     Current age     23
7   A-3     Birth           0
8   A-3     Brain cancer    30
19  A-3     Surgery         30
9   A-3     Lung cancer     33
9   A-3     Chemo           33
10  A-3     Current age     35

Upvotes: 3

Will

Reputation: 1541

This attempt is pretty verbose and takes a few steps. WE can't do a simple pivot or index/column stacking because we need to modify one column with partial results from another. This requires splitting and appending.

Firstly, convert your dataframe into dtypes we can work with.

data = [['A-1', 'Birth', '0', '0', '0'], ['A-1', 'Lung cancer', '25', '25','25'],['A-1', 'Death', '50', '0','0'],['A-2', 'Birth', '0', '0','0'], ['A-2','Brain cancer', '12', '12','0'],['A-2', 'Skin cancer', '20','20','20'], ['A-2', 'Current age', '23', '0','0'],['A-3', 'Birth','0','0','0'], ['A-3', 'Brain cancer', '30', '0','30'], ['A-3', 'Lung cancer', '33', '33', '0'], ['A-3', 'Current age', '35', '0','0']]
df = pd.DataFrame(data, columns=["ID", "Diagnosis", "Age at Diagnosis", "Chemo", "Surgery"])

df[["Age at Diagnosis", "Chemo", "Surgery"]] = df[["Age at Diagnosis", "Chemo", "Surgery"]].astype(int)

Now we split the thing up into bits and pieces.

# I like making a copy or resetting an index so that 
# pandas is not operating off a slice
df_chemo = df[df.Chemo > 0].copy()
df_surgery = df[df.Surgery > 0].copy()

# drop columns you don't need
df_chemo.drop(["Chemo", "Surgery"], axis=1, inplace=True)
df_surgery.drop(["Chemo", "Surgery"], axis=1, inplace=True)
df.drop(["Chemo", "Surgery"], axis=1, inplace=True)

# Set Chemo and Surgery Diagnosis
df_chemo.Diagnosis = "Chemo"
df_surgery.Diagnosis = "Surgery"

Then append everything together. You can do this because the column dimensions match.

df_new = df.append(df_chemo).append(df_surgery)
# make it look pretty
df_new.sort_values(["ID", "Age at Diagnosis"]).reset_index(drop=True)

Upvotes: 1

How to restructure dataframe to convert column values to row values based on condition

Answers (2)

Related Questions