Reputation: 1727
I have the following dataframe in pandas (the df below is abbreviated):
Index: 23253 entries, 7.0 to 30559.0
Data columns (total 17 columns):
Epoch 23190 non-null values
follow 23253 non-null values
T_Opp 245 non-null values
T_Dir 171 non-null values
Teacher 0 non-null values
Activity 23253 non-null values
Actor 23253 non-null values
Recipient1 14608 non-null values
dtypes: float64(10), object(7)
Columns like T_Opp and T_Dir have dummy (1/0) data in them. When values in these columns are true, I want to add data from the 'Actor' column to the 'Teacher' column. So far, I have this (where the "mask" gives the condition under which the data are true. checked this bit and it works):
opp_mask = f_acts['Behavior'].str.contains('bp', na=False)
opp_teacher = f_acts[opp_mask]['Recipient1']
If I were doing this based only on one column, I could simply plug these results into the Teacher column in the dataframe with something like this:
df['Teacher'] = df[opp_mask]['Actor']
But I need to fill the Teacher column with with data from 6 other columns, without overwriting the earlier columns. I have an idea of how this might work, similar to this toy example:
list = [1]*len(df.Teacher)
df['Teacher'] = list
But I can't seem to figure out how to transform the output of the "mask" technique above to the correct format for this approach--it has the same index info but is shorter than the dataframe I need to add it to. What am I missing?
UPDATE: Adding the data below to clarify what I'm trying to do.
follow T_Opp T_Dir T_Enh T_SocTol Teacher Actor Recipient1
7 0 1 0 0 NaN 51608 f
8 0 0 0 0 NaN bla NaN
11 0 0 0 0 NaN 51601 NaN
13 1 0 0 1 NaN f 51602
18 0 0 0 0 NaN f NaN
So for data like these, what I'm trying to do is check the T_ columns one at a time. If the value in a T_ column is true, fetch the data from the Actor column (if looking at the T_Opp or T_SocTol columns) or from the Recipient column (if looking at T_Enh or T_Dir columns). I want to copy that data into the currently empty Teacher column.
More than one of the T_ columns can be true at a time, but in these cases it will always be "grabbing" the same data twice. (In other words, I never need data from BOTH the Actor and Recipient columns. Only one or the other, for each row).
I want to copy that data into the currently empty Teacher column.
Upvotes: 3
Views: 4745
Reputation: 49788
Here's an approach to masking and concatenating multiple columns with Series.where()
. If the end result is a column of strings, numeric columns will need to be converted to string first with .astype(str)
.
In [23]: df
Out[23]:
C0 Mask1 Mask2 Val1 Val2
0 R_l0_g0 0 0 v1 v2
1 R_l0_g1 1 0 v1 v2
2 R_l0_g2 0 1 v1 v2
3 R_l0_g3 1 1 v1 v2
In [24]: df['Other'] = (df.Val1.astype(str).where(df.Mask1, '') + ',' +
df.Val2.astype(str).where(df.Mask2, '')).str.strip(',')
In [25]: df
Out[25]:
C0 Mask1 Mask2 Val1 Val2 Other
0 R_l0_g0 0 0 v1 v2
1 R_l0_g1 1 0 v1 v2 v1
2 R_l0_g2 0 1 v1 v2 v2
3 R_l0_g3 1 1 v1 v2 v1,v2
And here's another approach using DataFrame.where()
. .where
, like most pandas operations, performs automatic data alignment. Since the column names of the data frame and frame to mask with differ in this case, alignment can be disabled by masking with a raw, un-labeled numpy.ndarray
(aka. .values
).
In [23]: masked = df[['Val1', 'Val2']].\
where(df[['Mask1', 'Mask2']].values, '') + ','
In [24]: df['Other2'] = masked.sum(axis=1).str.strip(',')
In [25]: df
Out[25]:
C0 Mask1 Mask2 Val1 Val2 Other Other2
0 R_l0_g0 0 0 v1 v2
1 R_l0_g1 1 0 v1 v2 v1 v1
2 R_l0_g2 0 1 v1 v2 v2 v2
3 R_l0_g3 1 1 v1 v2 v1,v2 v1,v2
Upvotes: 1