edesz
edesz

Reputation: 12406

Pandas compare each row to reference row - certain columns only

I have the following Pandas Dataframe**** in Python.

   Temp_Fact Oscillops_read         A         B         C         D         E         F         G         H         I         J
0          A          Today  0.710213  0.222015  0.814710  0.597732  0.634099  0.338913  0.452534  0.698082  0.706486  0.433162
1          B          Today  0.653489  0.452543  0.618755  0.555629  0.490342  0.280299  0.026055  0.138876  0.053148  0.899734
2          A          Aactl  0.129211  0.579690  0.641324  0.615772  0.927384  0.199651  0.652395  0.249467  0.262301  0.049795
3          A            DFE  0.743794  0.355085  0.637794  0.633634  0.810033  0.509244  0.470418  0.972145  0.647222  0.610636
4          C    Real_Mt_Olv  0.724282  0.332965  0.063078  0.004550  0.585398  0.869376  0.232148  0.630162  0.102206  0.232981
5          E         Q_Mont  0.221685  0.224834  0.110734  0.397999  0.814153  0.552924  0.981098  0.536750  0.251941  0.383994
6          D            DFE  0.655386  0.561297  0.305310  0.140998  0.433054  0.118187  0.479206  0.556546  0.556017  0.025070
7          F           Bryo  0.257884  0.228650  0.413149  0.285651  0.814095  0.275627  0.775620  0.392448  0.827725  0.935581
8          C          Aactl  0.017388  0.133848  0.939049  0.159416  0.923788  0.375638  0.331078  0.939089  0.098718  0.785569
9          C          Today  0.197419  0.595253  0.574718  0.373899  0.363200  0.289378  0.698455  0.252657  0.357485  0.020484
10         C           Pars  0.037771  0.683799  0.184114  0.545062  0.857000  0.295918  0.733196  0.613165  0.180642  0.254839
11         B           Pars  0.637346  0.090000  0.848710  0.596883  0.027026  0.792180  0.843743  0.461608  0.552165  0.215250
12         B           Pars  0.768422  0.017828  0.090141  0.108061  0.456734  0.803175  0.454479  0.501713  0.687016  0.625260
13         E       Tomorrow  0.860112  0.532859  0.091641  0.768896  0.635966  0.007211  0.656367  0.053136  0.482367  0.680557
14         D            DFE  0.801734  0.365921  0.243407  0.826373  0.904416  0.062448  0.801726  0.049983  0.433135  0.351150
15         F         Q_Mont  0.360710  0.330745  0.598830  0.582379  0.828019  0.467044  0.287276  0.470980  0.355386  0.404299
16         D      Last_Week  0.867126  0.600093  0.813257  0.005423  0.617543  0.657219  0.635255  0.314910  0.016516  0.689257
17         E      Last_Week  0.551499  0.724981  0.821087  0.175279  0.301397  0.304105  0.379553  0.971244  0.558719  0.154240
18         F           Bryo  0.511370  0.208831  0.260223  0.089106  0.121442  0.120513  0.099722  0.750769  0.860541  0.838855
19         E           Bryo  0.323441  0.663328  0.951847  0.782042  0.909736  0.512978  0.999549  0.225423  0.789240  0.155898
20         C       Tomorrow  0.267086  0.357918  0.562190  0.700404  0.961047  0.513091  0.779268  0.030190  0.460805  0.315814
21         B       Tomorrow  0.951356  0.570077  0.867533  0.365708  0.791373  0.232377  0.478656  0.003857  0.805882  0.989754
22         F          Today  0.963750  0.118826  0.264858  0.571066  0.761669  0.967419  0.565773  0.468971  0.466120  0.174815
23         B      Last_Week  0.291186  0.126748  0.154725  0.527029  0.021485  0.224272  0.259218  0.052286  0.205569  0.617701
24         F          Aactl  0.269308  0.655920  0.595518  0.404817  0.290342  0.447246  0.627082  0.306856  0.868357  0.979879

I also have a Series of values for each column:

df_base = df[df['Oscillops_read'] == 'Last_Week']
df_base_val = df_base.mean(axis=0)

As you can see, this is a Pandas Series and it is the average, for each column, over rows where Oscillops_read == 'Last_Week'. Here is the Series:

[0.56993702256121603, 0.48394061768804786, 0.59635616273775061, 0.23591030688019868, 0.31347492150330231, 0.39519847430740507, 0.42467546792253791, 0.4461465888887961, 0.26026797943899194, 0.48706569569369912]

I also have 2 lists:

1.

range_name_list = ['Base','Curnt','Prediction','Graph','Swg','Barometer_Output','Test_Cntr']

This list gives values that must be added to the dataframe df under certain conditions (described below).

2.

col_1 = list('DFA')
col_2 = list('ACEF')
col_3 = list('CEF')
col_4 = list('ABDF')
col_5 = list('DEF')
col_6 = list('AC')
col_7 = list('ABCDE')

These are lists of column names. These columns from df must be compared to the average Series above. So for example, for the 6th list col_6, columns A and C from each row of the dataframe df must be compared to columns A and C of the Series.

Problem: As I mentioned above, I need to compare specific columns from the dataframe df to the base Series df_base_val. The columns to be compared are listed in col_1, col_2, col_3, ..., col_7. Here is what I need to do:

Example: eg. use col_6 - this is the 6th list and it has column names A and C.

  1. Step 1: For row 1 of df, columns A and C are greater than df_base_val[A] and df_base_val[C] respectively.
  2. Step 2: Thus, for row 1, in a new column Range, enter the 6th element from the list range_name_list - the 6th element is Barometer_Output.

Example Output: After doing this, the 1st row becomes:

0          A          Today  0.710213  0.222015  0.814710  0.597732  0.634099  0.338913  0.452534  0.698082  0.706486  0.433162  'Barometer_Output'

Now, if this row were NOT to be greater than the Series in columns A and C and is not greater than the Series in columns from col_1, col_2, etc. then the Range column must be assigned the value 'Not_in_Range'. In that case, this row would become:

0          A          Today  0.710213  0.222015  0.814710  0.597732  0.634099  0.338913  0.452534  0.698082  0.706486  0.433162  'Not_in_Range'

Simplification and Question: In this example:

  1. I only compared the 1st row to the base series. I need to compare all rows of df individually to the base series and add an appropriate value.

  2. I only used the 6th list of columns - this was col_6. Similarly, I need to go through each list of column names - col_1, col_2, ...., col_7.

  3. If the row being compared is not greater than any of the lists col_1 to col_7, in the specified columns, then the column Range must be assigned the value 'Not_in_Range'.

Is there a way to do this? Maybe using loops?

**** to create the above dataframe, select it from above and copy. Then use the following code:

import pandas as pd
df = pd.read_clipboard()
print df

EDIT: If multiple conditions are met, I would need that they all be listed. i.e. if the row belongs to 'Swg' and 'Curnt', then I would need to list both of these in the Range column, or to create separate Range columns, or just Python lists, for each matching result. Range1 would list 'Swg' and column Range2 would list 'Curnt', etc.

Upvotes: 1

Views: 3581

Answers (2)

emveebeeare
emveebeeare

Reputation: 139

For starters I would create a dictionary with your condition sets where the keys can be used as indices for your range_name_list list:

conditions = {0: list('DFA'),
              1: list('ACEF'),
              2: list('CEF'),
              3: list('ABDF'),
              4: list('DEF'),
              5: list('AC'),
              6: list('ABCDE')}

The following code will then accomplish what I understand to be your task:

# Create your Range column to be filled in later.
df['Range'] = '|'
for index, row in df.iterrows():
  for ix, list in conditions.iteritems():
    # Create a list of the outcomes of checking whether the
    # value for each condition column is greater than the 
    # df_base_val average.
    truths = [row[column] > df_base_val[column] for column in list]
    # See if all checks evaluated to True
    if sum(truths) == len(truths):
      # If so, set the 'Range' column's value for the current row
      # to the appropriate 'range_name'
      df.ix[index, 'Range'] = df.ix[index, 'Range'] + range_name_list[ix] + "|"
# Fill in all rows where no conditions were met with 'Not_in_Range'
df['Range'][df['Range'] == '|'] = 'Not_in_Range'

Upvotes: 2

HYRY
HYRY

Reputation: 97331

try this code:

df = pd.read_csv(BytesIO(txt), delim_whitespace=True)
df_base = df[df['Oscillops_read'] == 'Last_Week']
df_base_val = df_base.mean(axis=0)
columns = ['DFA', 'ACEF', 'CEF', 'ABDF', 'DEF', 'AC', 'ABCDE']
range_name_list = ['Base','Curnt','Prediction','Graph','Swg','Barometer_Output','Test_Cntr']

ranges = pd.Series(["NOT_IN_RANGE" for _ in range(df.shape[0])], index=df.index)

for name, cols in zip(range_name_list, columns):
    cols = list(cols)
    idx = df.index[(df[cols] > df_base_val[cols]).all(axis=1)]
    ranges[idx] = name

print ranges

But I don't know what you want if there are multiple range match with one row.

Upvotes: 2

Related Questions