WerkkreW
WerkkreW

Reputation: 364

Dataframe count set of conditions passed by several columns on a per row basis

I have a dataframe which looks something like this:

date   A_1  A_2  A_3  B_1  B_2  B_3  C_1  C_2  C_3  D_1  D_2  D_3
xxx    4    5    6    2    3    1    5    7    2    4    3    1
xxx    3    3    2    4    5    2    6    2    3    2    4    2
xxx    5    7    5    1    3    3    4    5    4    8    2    2
xxx    6    1    8    6    1    4    1    2    7    4    3    5

I am trying to compute a value based on a condition for every row which will apply across the column groupings of A, B, C, D, etc. and count how many of those groups passed the condition, for example, some pseudo-code:

count = 0
for each (A, B, C, D) as col:
    if col_1 > 3 and (col_2 > col_3):
        count++

dataframe['count'] = count   

Expected output:

date   A_1  A_2  A_3  B_1  B_2  B_3  C_1  C_2  C_3  D_1  D_2  D_3  count
xxx    4    5    6    2    3    1    5    7    2    4    3    1    2
xxx    3    3    2    4    5    2    6    2    3    2    4    2    1
xxx    5    7    5    1    3    3    4    5    4    8    2    2    2
xxx    6    1    8    6    1    4    1    2    7    4    3    5    0

This would mean the example dataframe would end up with a new column with a value of 2 because in this example column C and D pass the set of conditions.

Later I plan to do something for each row with a count value above some number.

The best way I was able to come up with looked something like this:

for col in cols:
    conditions.append(
        (
            (dataframe[f'{col}_1'] > min_corr_coef) & 
            (dataframe[f'{col}_2'] < (dataframe[f'{col_3}])
        ) | 
        (dataframe[f'{col}_1'] <= min_corr_coef)
    )

conditions.append(
    (dataframe[dataframe.loc[:,cols] > min_corr_coef].count(axis=1) >= min_corr_pair)
)

if conditions:
    dataframe.loc[
        reduce(lambda x, y: x & y, conditions),
        'pass'] = 1

Not only is this incorrect, but it is also extremely slow. What I like about it, is that it is readable and adding additional conditions if there are more columns/different data is relatively straightforward.

I am not sure if this type of operation is too complex to perform on a dataframe the way I am trying to and perhaps I need to change my data structure or not, but I figured I'd see if there is some way of going about this before I start to re-write things.

Ideally the logic being applied to the column groups (A, B, C, D in this case) can be modular and whatever method I use to do this should be easy to add/remove conditions because I want to be able to add/remove data columns and logic conditions over time.

What I am trying to do is test an arbitrary set of conditions when one of the columns for a group (which is a correlation coefficient) is above a certain value. If enough of the column groups pass the conditions, do something.

Thanks in advance, I am a python and pandas novice and this has been giving me a headache for days.

Upvotes: 1

Views: 70

Answers (2)

Henry Ecker
Henry Ecker

Reputation: 35626

Let's ignore date and handle the other columns first.

Remove date, create a MultiIndex using str.split, and stack to long form:

new_df = df.drop(columns='date')
new_df.columns = new_df.columns.str.split('_', expand=True)
new_df = new_df.stack(level=0)
     1  2  3
0 A  4  5  6
  B  2  3  1
  C  5  7  2
  D  4  3  1
1 A  3  3  2
  B  4  5  2
  C  6  2  3
  D  2  4  2
2 A  5  7  5
  B  1  3  3
  C  4  5  4
  D  8  2  2
3 A  6  1  8
  B  6  1  4
  C  1  2  7
  D  4  3  5

Apply the condition row-wise:

new_df['condition'] = new_df['1'].gt(3) & new_df['2'].gt(new_df['3'])

Then sum level 0 and assign back to the original df:

df['count'] = new_df['condition'].sum(level=0)

(Alternatively sum the conditions directly rather than assigning to both new_df and df)

df['count'] = (new_df['1'].gt(3) & new_df['2'].gt(new_df['3'])).sum(level=0)

df:

  date  A_1  A_2  A_3  B_1  B_2  B_3  C_1  C_2  C_3  D_1  D_2  D_3  count
0  xxx    4    5    6    2    3    1    5    7    2    4    3    1      2
1  xxx    3    3    2    4    5    2    6    2    3    2    4    2      1
2  xxx    5    7    5    1    3    3    4    5    4    8    2    2      2
3  xxx    6    1    8    6    1    4    1    2    7    4    3    5      0

Complete Working Example:

import pandas as pd

df = pd.DataFrame({
    'date': ['xxx', 'xxx', 'xxx', 'xxx'], 'A_1': [4, 3, 5, 6],
    'A_2': [5, 3, 7, 1], 'A_3': [6, 2, 5, 8],
    'B_1': [2, 4, 1, 6], 'B_2': [3, 5, 3, 1],
    'B_3': [1, 2, 3, 4], 'C_1': [5, 6, 4, 1],
    'C_2': [7, 2, 5, 2], 'C_3': [2, 3, 4, 7],
    'D_1': [4, 2, 8, 4], 'D_2': [3, 4, 2, 3],
    'D_3': [1, 2, 2, 5]
})

new_df = df.drop(columns='date')
new_df.columns = new_df.columns.str.split('_', expand=True)
new_df = new_df.stack(level=0)
new_df['condition'] = new_df['1'].gt(3) & new_df['2'].gt(new_df['3'])
df['count'] = new_df['condition'].sum(level=0)
print(df)

Upvotes: 1

smci
smci

Reputation: 33940

Solution as one-liner:

>>> sum(df1.iloc[:, range(3*i,3*(i+1))].apply(lambda x: x[0] > 3 and (x[1] > x[2]), axis=1) for i in range(len(df1.columns)//3)).tolist()

[2]

Approach: Use .iloc[] slicing, if it's guaranteed that the column names occur in regular groups of three named X_1, X_2, X_3:

for i in range(len(df.columns)//3):
    df.iloc[:, range(3*i,3*(i+1))]
  
      A_1  A_2  A_3
date               
xxx     4    5    6
      B_1  B_2  B_3
date               
xxx     2    3    1
      C_1  C_2  C_3
date               
xxx     5    7    2
      D_1  D_2  D_3
date               
xxx     4    3    1

Then you can apply(..., axis=1) whatever lamba function or custom function you want on each slice of three columns:

for i in range(len(df1.columns)//3):
    # Your commands go here, print()ing just as an example
    print(df1.iloc[:, range(3*i,3*(i+1))])
    print(df1.iloc[:, range(3*i,3*(i+1))].apply(lambda x: x[0] > 3 and (x[1] > x[2]), axis=1))

      A_1  A_2  A_3
date               
xxx     4    5    6
date
xxx    False
dtype: bool
      B_1  B_2  B_3
date               
xxx     2    3    1
date
xxx    False
dtype: bool
      C_1  C_2  C_3
date               
xxx     5    7    2
date
xxx    True
dtype: bool
      D_1  D_2  D_3
date               
xxx     4    3    1
date
xxx    True
dtype: bool

Now you can apply() across each column-slice-of-three-columns, and then sum() vertically, or whatever you want (instead of for-loop with increment.)

Upvotes: 1

Related Questions