Reputation: 4723
I have a variable in Pandas dataframe called "label" which contains multiple string values (for example: 'label1', "label2', 'label3'...
).
label
label1
label1
label23
label3
label11
I output all unique values into a list and then create new variables
unique_labels = df['label'].unique()
for i in unique_labels: # create new single label variable holders
df[str(i)] = 0
Now I have
label label1 label2 .... label23
label1 0 0 0
label23 0 0 0
I want to assign corresponding value based on 'label'
onto the new single label variables, as following
label label1 label2 .... label23
label1 1 0 0
label23 0 0 1
Here is my code
def single_label(df):
for i in range(len(unique_labels)):
if df['label'] == str(unique_labels[i]):
df[unique_labels[i]] == 1
df = df.applymap(single_label)
Getting this error
TypeError: ("'int' object is not subscriptable", 'occurred at index Unnamed: 0')
Upvotes: 0
Views: 236
Reputation: 51395
IIUC, you can use pd.get_dummies
, after you drop duplicates, which will be faster and result in cleaner code than doing it iteratively:
df.drop_duplicates().join(pd.get_dummies(df.drop_duplicates()))
label label_label1 label_label11 label_label23 label_label3
0 label1 1 0 0 0
2 label23 0 0 1 0
3 label3 0 0 0 1
4 label11 0 1 0 0
You can get rid of those label
prefixes and underscores using the prefix
and prefix_sep
arguments:
df.drop_duplicates().join(pd.get_dummies(df.drop_duplicates(),
prefix='', prefix_sep=''))
label label1 label11 label23 label3
0 label1 1 0 0 0
2 label23 0 0 1 0
3 label3 0 0 0 1
4 label11 0 1 0 0
Edit: with a second column, i.e.:
>>> df
label second_column
0 label1 a
1 label1 b
2 label23 c
3 label3 d
4 label11 e
Just call pd.get_dummies
on only the label column:
df.drop_duplicates('label').join(pd.get_dummies(df['label'].drop_duplicates(),
prefix='', prefix_sep=''))
label second_column label1 label11 label23 label3
0 label1 a 1 0 0 0
2 label23 c 0 0 1 0
3 label3 d 0 0 0 1
4 label11 e 0 1 0 0
But then you're getting rid of the rows without duplicates, and I don't think that's what you want (unless I'm mistaken). If not, just omit the drop duplicates calls:
df.join(pd.get_dummies(df['label'], prefix='', prefix_sep=''))
label second_column label1 label11 label23 label3
0 label1 a 1 0 0 0
1 label1 b 1 0 0 0
2 label23 c 0 0 1 0
3 label3 d 0 0 0 1
4 label11 e 0 1 0 0
Upvotes: 2