Reputation: 895
I have a number of pandas dataframes that each have a column 'speaker', and one of two labels. Typically, this is 0-1, however in some cases it is 1-2, 1-3, or 0-2. I am trying to find a way to iterate through all of my dataframes and standardize them so that they share the same labels (0-1).
The one consistent feature between them is that the first label to appear (i.e. in the first row of the dataframe) should always be mapped to '0', where as the second should always be mapped to '1'.
Here is an example of one of the dataframes I would need to change - being mindful that others will have different labels:
import pandas as pd
data = [1,2,1,2,1,2,1,2,1,2]
df = pd.DataFrame(data, columns = ['speaker'])
I would like to be able to change so that it appears as [0,1,0,1,0,1,0,1,0,1].
Thus far, I have tried inserting the following code within a bigger for loop that iterates through each dataframe. However it is not working at all:
for label in data['speaker']:
if label == data['speaker'][0]:
label = '0'
else:
label = '1'
Hopefully, what the above makes clear is that I am attempting to create a rule akin to: "find all instances in 'Speaker' that match the label in the first index position and change this to '0'. For all other instances change this to '1'."
Upvotes: 1
Views: 176
Reputation: 42896
We can use iat
+ np.where
here for conditional creation of your column:
# import numpy as np
first_val = df['speaker'].iat[0] # same as df['speaker'].iloc[0]
df['speaker'] = np.where(df['speaker'].eq(first_val), 0, 1)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
We can also make use of booleans
, since we can cast them to integers
:
first_val = df['speaker'].iat[0]
df['speaker'] = df['speaker'].ne(first_val).astype(int)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Only if your values are actually 1, 2
we can use floor division
:
df['speaker'] = df['speaker'] // 2
# same as: df['speaker'] = df['speaker'].floordiv(2)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Upvotes: 2
Reputation: 1054
You can use a iloc
to get the value of the first row and the first column, and then a mask to set the values:
zero_map = df["speaker"].iloc[0]
mask_zero = df["speaker"] == zero_map
df.loc[mask_zero] = 0
df.loc[~mask_zero] = 1
print(df)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Upvotes: 1