Average the values of list of tuples which have same identifiers

Question

I have the following list of tuples. I call the first two elements of the tuples identifiers.

mylist = [('a', 'b', 1, 2),
          ('c', 'd', 5, 6),
          ('e', 'f', 0, 3),
          ('c', 'd', 7, 8),
          ('a', 'b', 9, 4)]

For rows in mylist that have the same identifiers, I would like to take the average of the third and fourth elements across the rows with the same identifiers and create a newlist like the one below.

newlist = [('a', 'b', 5, 3),
           ('c', 'd', 6, 7),
           ('e', 'f', 0, 3)]

Here is my working solution. But, there should be a more efficient solution. Since I implement this in a very large list (1.6m rows), I need something better.

Temp = [(i[0],i[1]) for i in mylist]
Temp2 = {i: [] for i in list(set(Temp))}
Temp3 = {i: [] for i in Temp2.keys()}
for i in Temp2.keys():
    for j in mylist:
        if i == (j[0],j[1]):
            Temp2[i].append(j[2])
            Temp3[i].append(j[3])
newlist = [(i[0],i[1],np.mean(Temp2[i]),np.mean(Temp3[i])) for i in Temp2.keys()]

Tom Harvey · Accepted Answer

The key to your performance issue is to not loop over your list many times. Especially if it’s large.

The below will group the values by identifier and then get the means and reform the desired output


delimiter = "-" # choose something not in your identifier. 

grouped_data = {}
for (id1, id2, val1, val2) in mylist:
    identifier = (delimiter).join([id1, id2])
    try:
        grouped_data[identifier]["col3"].append(val1)
        grouped_data[identifier]["col4"].append(val2)
    except KeyError:
        grouped_data[identifier] = {}
        grouped_data[identifier]["col3"] = [val1]
        grouped_data[identifier]["col4"] = [val2]

output_list = []
for identifier, values in grouped_data.items():
    this_output = identifier.split(delimiter)
    this_output.append(sum(values["col3"]) / len(values["col3"]))
    this_output.append(sum(values["col4"]) / len(values["col4"]))
    output_list.append(this_output)

print(output_list)

Here we only loop across mylist once. We then loop across the grouped_data once; but that’s a much smaller list.

Average the values of list of tuples which have same identifiers

Answers (2)

Related Questions