Reputation: 43
Suppose there is a data frame and statistics as following table:
Time | A | B | C | D | E | Combination A~E |
---|---|---|---|---|---|---|
1 | 10 | 20 | 5 | 8 | -8 | 7.00 |
2 | 20 | -3 | 6 | -5 | 10 | 5.60 |
3 | -5 | 40 | 5 | 4 | 20 | 12.80 |
4 | 6 | -8 | 2 | 18 | 5 | 4.60 |
5 | 8 | -1 | -4 | -5 | 4 | 0.40 |
Average | 7.80 | 9.60 | 2.80 | 4.00 | 6.20 | 6.08 |
Standard Deviation | 8.01 | 17.96 | 3.66 | 8.65 | 9.09 | 4.02 |
Ratio (Average/S.D) | 0.97 | 0.53 | 0.77 | 0.46 | 0.68 | 1.51 |
According to the above table, combination of all A~E give the overall average of 6.08, standard deviation of 4.02 and the ratio of 1.51.
By removing some of the columns A ~ E, maybe I can maximize the average, minimize the standard deviation, or doing the both (increasing the ratio).
(*Meaning that I will remove the entire values in the column, not individual value in each row)
Is there any way I can find out the "best combination" of the columns from A to E which give the:
A) Maximum average
B) Minimum standard deviation
C) Maximum ratio of A/B?
Desired answer might be something like: A, B, C A, B, E A, B, C, D or even A, B, C, D, E
Upvotes: 1
Views: 447
Reputation: 4543
For finding the best matching subset of each row you can create all possible subsets then find the max/min as you wish. For example the following code compute the best cols for the first condition:
import itertools
import numpy as np
def com(row):
s = []
for i in range(2, len(row)+1):
s.extend([x for x in itertools.combinations(row.values, i)])
vals = s[np.argmax([np.mean(x) for x in s])]
return row[row.isin(vals)].index.values
df.apply(com, axis = 1)
for the first two rows the result is as follows:
1 [A, B]
2 [A, E]
dtype: object
Based on the comment this may helps:
string = '''Time A B C D E
Average 7.80 9.60 2.80 4.00 6.20
Standard Deviation 8.01 17.96 3.66 8.65 9.09
Ratio (Average/S.D) 0.97 0.53 0.77 0.46 0.68'''
data = np.array([x.split(' ') for x in string.split('\n')])
df = pd.DataFrame(data[1:,1:], columns = data[0,1:], index = data[1:,0])
df = df.astype(float)
df.apply(com, axis = 1)
Output:
Average [A, B]
Standard Deviation [B, E]
Ratio (Average/S.D) [A, C]
dtype: object
You can choose the best based on your goal.
Upvotes: 2