Restructure dataframe (maybe pivot or unpivot) to have each column display the label of data based on 0's and 1's

Question

I have survey data. The survey asks a question and the respondents pick one or more given categories for each question. The survey then asks demographic questions such as gender. The output is a dataframe with demographic information as columns and a matrix of 0's and 1's for each category in each question (0 = not selected and 1 = selected).

To help you better understand how this looks like I have the following data frame:

df = pd.DataFrame({'Survey ID': [1,2,3],
                   'Q1_Topic A': [0,1,1], 
                   'Q1_Topic B': [1,0,1], 
                   'Q1_Topic C': [1,0,0],
                   'Q2_Topic X': [0,0,1], 
                   'Q2_Topic Y': [0,1,0], 
                   'Q2_Topic Z': [0,0,1],
                   'Gender': ['Male', 'Female', 'Male']
                  })
print(df)

I need to transform this dataframe to show me a column for each question and multiple rows for each survey depending on how many categories were chosen. Each row should have a category under the relevant question column.

Confused yet? Its hard to explain but the data should look like

df2 = pd.DataFrame({'Survey ID': [1,1,2,3,3],
                   'Q1': ['B','C','A','A','B'], 
                   'Q2': [float('nan'), float('nan'), 'Y', 'X', 'Z'],
                   'Gender': ['Male', 'Male', 'Female', 'Male', 'Male']
                    })
print(df2)

Basically I need to transform df to df2. Note: There is a common separator of "_" for the question and topic for each column label.

As always thanks a lot for you help in advanced. Without this community I would be seriously stuck sometimes and I am learning a lot through this platform.

jezrael · Accepted Answer

Use:

#convert to MultiIndex all not Q topic columns
df2 = df.set_index(['Survey ID','Gender'])
#split columns names to MultiIndex in columns
df2.columns = df2.columns.str.split(expand=True)
#reshape
df2 = df2.stack()
#filter only rows with at least one 1 per row and reshape for remove NaNs
#also replace 0 to NaNs
df2 = df2[df2.eq(1).any(axis=1)].replace(0, np.nan).stack().reset_index(level=2)

#added helper level to MultiIndex because possible duplicates by counter
df2['g'] = df2.groupby(level=[0,1,2]).cumcount()
#final reshape
df2 = (df2.set_index('g', append=True)['level_2']
          .unstack(2)
          .reset_index(level=2, drop=True)
          .reset_index())

print (df2)
   Survey ID  Gender Q1_Topic Q2_Topic
0          1    Male        B      NaN
1          1    Male        C      NaN
2          2  Female        A        Y
3          3    Male        A        X
4          3    Male        B        Z

Restructure dataframe (maybe pivot or unpivot) to have each column display the label of data based on 0's and 1's

Answers (2)

Related Questions

Restructure dataframe (maybe pivot or unpivot) to have each column display the label of data based on 0&#39;s and 1&#39;s

Answers (2)

Related Questions

Restructure dataframe (maybe pivot or unpivot) to have each column display the label of data based on 0's and 1's