Reputation: 13
I need to prepare my data for modelling and I want to create a dataframe with 0-1 values for the columns. I have a list with different columns which i want to one hot encode into a dataframe.
List = [['DRT', 'AFV'], ['CLN', 'DRT', 'AFV'], ['CLN', 'DRT', 'AFV'], ['BLN', 'PCK', 'CAL', 'WBL', 'BCO', 'UPG', 'CLN', 'DRT'], ['BLN', 'AFV', 'CAL', 'WBL', 'UPG', 'CLN', 'DRT'], ['COA', 'BLN', 'PCK', 'CAL', 'WBL', 'UPG', 'CLN', 'DRT'], ['COA', 'BLN', 'PCK', 'CAL', 'WBL', 'UPG', 'CLN', 'DRT']]
I want to have a dataframe as shown below with 1 values for the items in the list and 0 values that are not in the list, and then different rows for each list in this list. There are a total of 28 different values that can be in the list.
[![df][1]][1]
I tried "get_dummies" but this creates different columns like 1_DRT ... 7_DRT because of the different locations of DRT in the dataframe. Also tried using some functions from Scikitlearn but without succes. Would really appreciate some help with this one.
Edit: Columns of the eventual dataframe with the 0-1 values -->
columns=['CLN', 'AFV', 'DRT', 'CAL', 'WBL', 'BLN', 'UPG', 'BCO', 'PCK', 'COA', 'WPK', 'WCO', '1CL', 'DRY', 'RES', 'WFR', 'FRZ', 'REC', 'CHF', 'STP', 'DFR', 'HOT', 'EXT', 'PIL', 'SPL', 'INS', 'SVT', 'UVP'] [1]: https://i.sstatic.net/nuUp9.png
Upvotes: 1
Views: 125
Reputation: 23227
You can create a Pandas Series for List
and .explode()
the list into different rows and then use .str.get_dummies()
to get the dummy table for each explode row. Aggregate the rows of original list by .max(level=0)
:
df = pd.Series(List).explode().str.get_dummies().max(level=0)
Result:
print(df)
AFV BCO BLN CAL CLN COA DRT PCK UPG WBL
0 1 0 0 0 0 0 1 0 0 0
1 1 0 0 0 1 0 1 0 0 0
2 1 0 0 0 1 0 1 0 0 0
3 0 1 1 1 1 0 1 1 1 1
4 1 0 1 1 1 0 1 0 1 1
5 0 0 1 1 1 1 1 1 1 1
6 0 0 1 1 1 1 1 1 1 1
Upvotes: 3