Reputation: 359
I want to calculate average time for sequences AD-VV-DD
and AD-MM-PP
. This sequence may appear in any columns MD_*
. To calculate average time, the columns TIME_*
should be used:
df =
MD_1 MD_2 MD_3 MD_4 MD_5 TIME_1 TIME_2 TIME_3 TIME_4 TIME_5
NaN AD VV DD NaN NaN 3 2 1 NaN
AD VV DD NaN NaN 1 1 1 NaN NaN
AD MM PP NaN NaN 4 3 3 NaN NaN
TT AD MM NaN NaN 2 4 NaN NaN NaN
The result should be this one:
result =
MD_1_new MD_2_new MD_3_new TIME_1_new TIME_2_new TIME_3_new
AD VV DD 2 1.5 1
AD MM PP 4 3 3
The TIME_*
columns of the first row were calculated as follows: in df
there are two sequences AD-VV-DD
. The columns TIME_X
are selected based on the values of X
in MD_X
.
This is what I tried, but how do I calculate mean values of corresponding TIME_*
?:
def calculate_mean_times(df):
# Create drop rows
drop_rows = []
index = 0
for index, x in df.iterrows():
row = x.dropna()
if ((row[0] == 'AD') & (row[1] == 'VV') & (row[2] == 'DD')):
drop_rows.append(index)
index = index + 1
return drop_rows
# Drop the rows in list
correct_rows = calculate_mean_times(df)
df = df.loc[correct_rows,:]
Upvotes: 1
Views: 60
Reputation: 49862
Here is some code to meet your requirements. The primary organizing principle it to build a tuple
for each key we need to find, and then build a dict with these keys. For each row in the dataframe, check if there is a key present in each of the three possible locations. Check by seeing if the key is present in the dictionary. If present, store the aligned data values for later averaging.
Code:
# build a dict with tuple keys for the results
matches = {
('AD', 'VV', 'DD'): [],
('AD', 'MM', 'PP'): [],
}
# for each row check for key matches
for i, row in df.iterrows():
keys = tuple(row.values[0:5])
for j in range(3):
try:
# check if these three columns match one of our tuple keys
# if it matches append the three columns of data
matches[tuple(keys[j:j+3])].append(
row.values[5+j:8+j].astype(int))
break
except KeyError:
pass
# average the data
avg = {}
for k, v in matches.items():
avg[k] = sum(v) / float(len(v))
print(avg)
Test Data:
data = [x.strip().split() for x in """
MD_1 MD_2 MD_3 MD_4 MD_5 TIME_1 TIME_2 TIME_3 TIME_4 TIME_5
NaN AD VV DD NaN NaN 3 2 1 NaN
AD VV DD NaN NaN 1 1 1 NaN NaN
AD MM PP NaN NaN 4 3 3 NaN NaN
TT AD MM NaN NaN 2 4 NaN NaN NaN
""".split('\n')[1:-1]]
df = pd.DataFrame(data[1:], columns=data[0])
Output:
{('AD', 'VV', 'DD'): array([ 2. , 1.5, 1. ]), ('AD', 'MM', 'PP'): array([ 4., 3., 3.])}
Upvotes: 1