Reputation: 364
I have a dataframe which looks something like this:
date A_1 A_2 A_3 B_1 B_2 B_3 C_1 C_2 C_3 D_1 D_2 D_3
xxx 4 5 6 2 3 1 5 7 2 4 3 1
xxx 3 3 2 4 5 2 6 2 3 2 4 2
xxx 5 7 5 1 3 3 4 5 4 8 2 2
xxx 6 1 8 6 1 4 1 2 7 4 3 5
I am trying to compute a value based on a condition for every row which will apply across the column groupings of A, B, C, D, etc. and count how many of those groups passed the condition, for example, some pseudo-code:
count = 0
for each (A, B, C, D) as col:
if col_1 > 3 and (col_2 > col_3):
count++
dataframe['count'] = count
Expected output:
date A_1 A_2 A_3 B_1 B_2 B_3 C_1 C_2 C_3 D_1 D_2 D_3 count
xxx 4 5 6 2 3 1 5 7 2 4 3 1 2
xxx 3 3 2 4 5 2 6 2 3 2 4 2 1
xxx 5 7 5 1 3 3 4 5 4 8 2 2 2
xxx 6 1 8 6 1 4 1 2 7 4 3 5 0
This would mean the example dataframe would end up with a new column with a value of 2 because in this example column C and D pass the set of conditions.
Later I plan to do something for each row with a count value above some number.
The best way I was able to come up with looked something like this:
for col in cols:
conditions.append(
(
(dataframe[f'{col}_1'] > min_corr_coef) &
(dataframe[f'{col}_2'] < (dataframe[f'{col_3}])
) |
(dataframe[f'{col}_1'] <= min_corr_coef)
)
conditions.append(
(dataframe[dataframe.loc[:,cols] > min_corr_coef].count(axis=1) >= min_corr_pair)
)
if conditions:
dataframe.loc[
reduce(lambda x, y: x & y, conditions),
'pass'] = 1
Not only is this incorrect, but it is also extremely slow. What I like about it, is that it is readable and adding additional conditions if there are more columns/different data is relatively straightforward.
I am not sure if this type of operation is too complex to perform on a dataframe the way I am trying to and perhaps I need to change my data structure or not, but I figured I'd see if there is some way of going about this before I start to re-write things.
Ideally the logic being applied to the column groups (A, B, C, D in this case) can be modular and whatever method I use to do this should be easy to add/remove conditions because I want to be able to add/remove data columns and logic conditions over time.
What I am trying to do is test an arbitrary set of conditions when one of the columns for a group (which is a correlation coefficient) is above a certain value. If enough of the column groups pass the conditions, do something.
Thanks in advance, I am a python and pandas novice and this has been giving me a headache for days.
Upvotes: 1
Views: 70
Reputation: 35626
Let's ignore date
and handle the other columns first.
Remove date, create a MultiIndex using str.split
, and stack
to long form:
new_df = df.drop(columns='date')
new_df.columns = new_df.columns.str.split('_', expand=True)
new_df = new_df.stack(level=0)
1 2 3
0 A 4 5 6
B 2 3 1
C 5 7 2
D 4 3 1
1 A 3 3 2
B 4 5 2
C 6 2 3
D 2 4 2
2 A 5 7 5
B 1 3 3
C 4 5 4
D 8 2 2
3 A 6 1 8
B 6 1 4
C 1 2 7
D 4 3 5
Apply the condition row-wise:
new_df['condition'] = new_df['1'].gt(3) & new_df['2'].gt(new_df['3'])
Then sum
level 0 and assign back to the original df
:
df['count'] = new_df['condition'].sum(level=0)
(Alternatively sum the conditions directly rather than assigning to both new_df
and df
)
df['count'] = (new_df['1'].gt(3) & new_df['2'].gt(new_df['3'])).sum(level=0)
df
:
date A_1 A_2 A_3 B_1 B_2 B_3 C_1 C_2 C_3 D_1 D_2 D_3 count
0 xxx 4 5 6 2 3 1 5 7 2 4 3 1 2
1 xxx 3 3 2 4 5 2 6 2 3 2 4 2 1
2 xxx 5 7 5 1 3 3 4 5 4 8 2 2 2
3 xxx 6 1 8 6 1 4 1 2 7 4 3 5 0
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'date': ['xxx', 'xxx', 'xxx', 'xxx'], 'A_1': [4, 3, 5, 6],
'A_2': [5, 3, 7, 1], 'A_3': [6, 2, 5, 8],
'B_1': [2, 4, 1, 6], 'B_2': [3, 5, 3, 1],
'B_3': [1, 2, 3, 4], 'C_1': [5, 6, 4, 1],
'C_2': [7, 2, 5, 2], 'C_3': [2, 3, 4, 7],
'D_1': [4, 2, 8, 4], 'D_2': [3, 4, 2, 3],
'D_3': [1, 2, 2, 5]
})
new_df = df.drop(columns='date')
new_df.columns = new_df.columns.str.split('_', expand=True)
new_df = new_df.stack(level=0)
new_df['condition'] = new_df['1'].gt(3) & new_df['2'].gt(new_df['3'])
df['count'] = new_df['condition'].sum(level=0)
print(df)
Upvotes: 1
Reputation: 33940
Solution as one-liner:
>>> sum(df1.iloc[:, range(3*i,3*(i+1))].apply(lambda x: x[0] > 3 and (x[1] > x[2]), axis=1) for i in range(len(df1.columns)//3)).tolist()
[2]
Approach: Use .iloc[]
slicing, if it's guaranteed that the column names occur in regular groups of three named X_1, X_2, X_3
:
for i in range(len(df.columns)//3):
df.iloc[:, range(3*i,3*(i+1))]
A_1 A_2 A_3
date
xxx 4 5 6
B_1 B_2 B_3
date
xxx 2 3 1
C_1 C_2 C_3
date
xxx 5 7 2
D_1 D_2 D_3
date
xxx 4 3 1
Then you can apply(..., axis=1)
whatever lamba function or custom function you want on each slice of three columns:
for i in range(len(df1.columns)//3):
# Your commands go here, print()ing just as an example
print(df1.iloc[:, range(3*i,3*(i+1))])
print(df1.iloc[:, range(3*i,3*(i+1))].apply(lambda x: x[0] > 3 and (x[1] > x[2]), axis=1))
A_1 A_2 A_3
date
xxx 4 5 6
date
xxx False
dtype: bool
B_1 B_2 B_3
date
xxx 2 3 1
date
xxx False
dtype: bool
C_1 C_2 C_3
date
xxx 5 7 2
date
xxx True
dtype: bool
D_1 D_2 D_3
date
xxx 4 3 1
date
xxx True
dtype: bool
Now you can apply()
across each column-slice-of-three-columns, and then sum()
vertically, or whatever you want (instead of for-loop with increment.)
Upvotes: 1