Reputation: 12406
I have the following Pandas Dataframe**** in Python.
Temp_Fact Oscillops_read A B C D E F G H I J
0 A Today 0.710213 0.222015 0.814710 0.597732 0.634099 0.338913 0.452534 0.698082 0.706486 0.433162
1 B Today 0.653489 0.452543 0.618755 0.555629 0.490342 0.280299 0.026055 0.138876 0.053148 0.899734
2 A Aactl 0.129211 0.579690 0.641324 0.615772 0.927384 0.199651 0.652395 0.249467 0.262301 0.049795
3 A DFE 0.743794 0.355085 0.637794 0.633634 0.810033 0.509244 0.470418 0.972145 0.647222 0.610636
4 C Real_Mt_Olv 0.724282 0.332965 0.063078 0.004550 0.585398 0.869376 0.232148 0.630162 0.102206 0.232981
5 E Q_Mont 0.221685 0.224834 0.110734 0.397999 0.814153 0.552924 0.981098 0.536750 0.251941 0.383994
6 D DFE 0.655386 0.561297 0.305310 0.140998 0.433054 0.118187 0.479206 0.556546 0.556017 0.025070
7 F Bryo 0.257884 0.228650 0.413149 0.285651 0.814095 0.275627 0.775620 0.392448 0.827725 0.935581
8 C Aactl 0.017388 0.133848 0.939049 0.159416 0.923788 0.375638 0.331078 0.939089 0.098718 0.785569
9 C Today 0.197419 0.595253 0.574718 0.373899 0.363200 0.289378 0.698455 0.252657 0.357485 0.020484
10 C Pars 0.037771 0.683799 0.184114 0.545062 0.857000 0.295918 0.733196 0.613165 0.180642 0.254839
11 B Pars 0.637346 0.090000 0.848710 0.596883 0.027026 0.792180 0.843743 0.461608 0.552165 0.215250
12 B Pars 0.768422 0.017828 0.090141 0.108061 0.456734 0.803175 0.454479 0.501713 0.687016 0.625260
13 E Tomorrow 0.860112 0.532859 0.091641 0.768896 0.635966 0.007211 0.656367 0.053136 0.482367 0.680557
14 D DFE 0.801734 0.365921 0.243407 0.826373 0.904416 0.062448 0.801726 0.049983 0.433135 0.351150
15 F Q_Mont 0.360710 0.330745 0.598830 0.582379 0.828019 0.467044 0.287276 0.470980 0.355386 0.404299
16 D Last_Week 0.867126 0.600093 0.813257 0.005423 0.617543 0.657219 0.635255 0.314910 0.016516 0.689257
17 E Last_Week 0.551499 0.724981 0.821087 0.175279 0.301397 0.304105 0.379553 0.971244 0.558719 0.154240
18 F Bryo 0.511370 0.208831 0.260223 0.089106 0.121442 0.120513 0.099722 0.750769 0.860541 0.838855
19 E Bryo 0.323441 0.663328 0.951847 0.782042 0.909736 0.512978 0.999549 0.225423 0.789240 0.155898
20 C Tomorrow 0.267086 0.357918 0.562190 0.700404 0.961047 0.513091 0.779268 0.030190 0.460805 0.315814
21 B Tomorrow 0.951356 0.570077 0.867533 0.365708 0.791373 0.232377 0.478656 0.003857 0.805882 0.989754
22 F Today 0.963750 0.118826 0.264858 0.571066 0.761669 0.967419 0.565773 0.468971 0.466120 0.174815
23 B Last_Week 0.291186 0.126748 0.154725 0.527029 0.021485 0.224272 0.259218 0.052286 0.205569 0.617701
24 F Aactl 0.269308 0.655920 0.595518 0.404817 0.290342 0.447246 0.627082 0.306856 0.868357 0.979879
I also have a Series of values for each column:
df_base = df[df['Oscillops_read'] == 'Last_Week']
df_base_val = df_base.mean(axis=0)
As you can see, this is a Pandas Series and it is the average, for each column, over rows where Oscillops_read == 'Last_Week'
. Here is the Series:
[0.56993702256121603, 0.48394061768804786, 0.59635616273775061, 0.23591030688019868, 0.31347492150330231, 0.39519847430740507, 0.42467546792253791, 0.4461465888887961, 0.26026797943899194, 0.48706569569369912]
I also have 2 lists:
1.
range_name_list = ['Base','Curnt','Prediction','Graph','Swg','Barometer_Output','Test_Cntr']
This list gives values that must be added to the dataframe df
under certain conditions (described below).
2.
col_1 = list('DFA')
col_2 = list('ACEF')
col_3 = list('CEF')
col_4 = list('ABDF')
col_5 = list('DEF')
col_6 = list('AC')
col_7 = list('ABCDE')
These are lists of column names. These columns from df
must be compared to the average Series above. So for example, for the 6th list col_6, columns A
and C
from each row of the dataframe df
must be compared to columns A
and C
of the Series.
Problem:
As I mentioned above, I need to compare specific columns from the dataframe df
to the base Series df_base_val
. The columns to be compared are listed in col_1, col_2, col_3, ..., col_7
. Here is what I need to do:
col_1
(eg. if a row for columns A
and C
) is greater than the base Series df_base_val
in those 2 columns then for that row, in a new column Range
, enter the 6th value from the list range_name_list
.Example:
eg. use col_6
- this is the 6th list and it has column names A
and C
.
df
, columns A
and C
are greater than
df_base_val[A]
and df_base_val[C]
respectively.Range
, enter the 6th element from the list range_name_list
- the 6th element is Barometer_Output
.Example Output: After doing this, the 1st row becomes:
0 A Today 0.710213 0.222015 0.814710 0.597732 0.634099 0.338913 0.452534 0.698082 0.706486 0.433162 'Barometer_Output'
Now, if this row were NOT to be greater than the Series in columns A
and C
and is not greater than the Series in columns from col_1
, col_2
, etc. then the Range
column must be assigned the value 'Not_in_Range'. In that case, this row would become:
0 A Today 0.710213 0.222015 0.814710 0.597732 0.634099 0.338913 0.452534 0.698082 0.706486 0.433162 'Not_in_Range'
Simplification and Question: In this example:
I only compared the 1st row to the base series. I need to compare
all rows of df
individually to the base series and add an appropriate value.
I only used the 6th list of columns - this was col_6
. Similarly, I need to go through each list of column names - col_1
, col_2
, ...., col_7
.
If the row being compared is not greater than any of the lists col_1
to col_7
, in the specified columns, then the column Range
must be assigned the value 'Not_in_Range'.
Is there a way to do this? Maybe using loops?
**** to create the above dataframe, select it from above and copy. Then use the following code:
import pandas as pd
df = pd.read_clipboard()
print df
EDIT: If multiple conditions are met, I would need that they all be listed. i.e. if the row belongs to 'Swg' and 'Curnt', then I would need to list both of these in the Range column, or to create separate Range columns, or just Python lists, for each matching result. Range1 would list 'Swg' and column Range2 would list 'Curnt', etc.
Upvotes: 1
Views: 3581
Reputation: 139
For starters I would create a dictionary with your condition sets where the keys can be used as indices for your range_name_list list:
conditions = {0: list('DFA'),
1: list('ACEF'),
2: list('CEF'),
3: list('ABDF'),
4: list('DEF'),
5: list('AC'),
6: list('ABCDE')}
The following code will then accomplish what I understand to be your task:
# Create your Range column to be filled in later.
df['Range'] = '|'
for index, row in df.iterrows():
for ix, list in conditions.iteritems():
# Create a list of the outcomes of checking whether the
# value for each condition column is greater than the
# df_base_val average.
truths = [row[column] > df_base_val[column] for column in list]
# See if all checks evaluated to True
if sum(truths) == len(truths):
# If so, set the 'Range' column's value for the current row
# to the appropriate 'range_name'
df.ix[index, 'Range'] = df.ix[index, 'Range'] + range_name_list[ix] + "|"
# Fill in all rows where no conditions were met with 'Not_in_Range'
df['Range'][df['Range'] == '|'] = 'Not_in_Range'
Upvotes: 2
Reputation: 97331
try this code:
df = pd.read_csv(BytesIO(txt), delim_whitespace=True)
df_base = df[df['Oscillops_read'] == 'Last_Week']
df_base_val = df_base.mean(axis=0)
columns = ['DFA', 'ACEF', 'CEF', 'ABDF', 'DEF', 'AC', 'ABCDE']
range_name_list = ['Base','Curnt','Prediction','Graph','Swg','Barometer_Output','Test_Cntr']
ranges = pd.Series(["NOT_IN_RANGE" for _ in range(df.shape[0])], index=df.index)
for name, cols in zip(range_name_list, columns):
cols = list(cols)
idx = df.index[(df[cols] > df_base_val[cols]).all(axis=1)]
ranges[idx] = name
print ranges
But I don't know what you want if there are multiple range match with one row.
Upvotes: 2