How to calculate mean values over varying columns?

Question

I want to calculate average time for sequences AD-VV-DD and AD-MM-PP. This sequence may appear in any columns MD_*. To calculate average time, the columns TIME_* should be used:

df = 
MD_1   MD_2   MD_3    MD_4   MD_5  TIME_1  TIME_2  TIME_3  TIME_4  TIME_5
NaN    AD     VV      DD     NaN   NaN     3       2       1       NaN
AD     VV     DD      NaN    NaN   1       1       1       NaN     NaN
AD     MM     PP      NaN    NaN   4       3       3       NaN     NaN
TT     AD     MM      NaN    NaN   2       4       NaN     NaN     NaN

The result should be this one:

result = 
MD_1_new   MD_2_new   MD_3_new   TIME_1_new TIME_2_new  TIME_3_new
AD         VV         DD         2          1.5         1
AD         MM         PP         4          3           3

The TIME_* columns of the first row were calculated as follows: in df there are two sequences AD-VV-DD. The columns TIME_X are selected based on the values of X in MD_X.

This is what I tried, but how do I calculate mean values of corresponding TIME_*?:

def calculate_mean_times(df):
    # Create drop rows
    drop_rows = []
    index = 0
    for index, x in df.iterrows():
        row = x.dropna()
        if ((row[0] == 'AD') & (row[1] == 'VV') & (row[2] == 'DD')):
            drop_rows.append(index)
        index = index + 1
    return drop_rows

# Drop the rows in list
correct_rows = calculate_mean_times(df)
df = df.loc[correct_rows,:]

Stephen Rauch · Accepted Answer

Here is some code to meet your requirements. The primary organizing principle it to build a tuple for each key we need to find, and then build a dict with these keys. For each row in the dataframe, check if there is a key present in each of the three possible locations. Check by seeing if the key is present in the dictionary. If present, store the aligned data values for later averaging.

Code:

# build a dict with tuple keys for the results
matches = {
    ('AD', 'VV', 'DD'): [],
    ('AD', 'MM', 'PP'): [],
}

# for each row check for key matches
for i, row in df.iterrows():
    keys = tuple(row.values[0:5])
    for j in range(3):
        try:
            # check if these three columns match one of our tuple keys
            # if it matches append the three columns of data
            matches[tuple(keys[j:j+3])].append(
                row.values[5+j:8+j].astype(int))
            break
        except KeyError:
            pass

# average the data
avg = {}
for k, v in matches.items():
    avg[k] = sum(v) / float(len(v))
print(avg)

Test Data:

data = [x.strip().split() for x in """
    MD_1   MD_2   MD_3    MD_4   MD_5  TIME_1  TIME_2  TIME_3  TIME_4  TIME_5
    NaN    AD     VV      DD     NaN   NaN     3       2       1       NaN
    AD     VV     DD      NaN    NaN   1       1       1       NaN     NaN
    AD     MM     PP      NaN    NaN   4       3       3       NaN     NaN
    TT     AD     MM      NaN    NaN   2       4       NaN     NaN     NaN
""".split('
')[1:-1]]
df = pd.DataFrame(data[1:], columns=data[0])

Output:

{('AD', 'VV', 'DD'): array([ 2. ,  1.5,  1. ]), ('AD', 'MM', 'PP'): array([ 4.,  3.,  3.])}

How to calculate mean values over varying columns?

Answers (1)

Related Questions