tcokyasar
tcokyasar

Reputation: 592

Average the values of list of tuples which have same identifiers

I have the following list of tuples. I call the first two elements of the tuples identifiers.

mylist = [('a', 'b', 1, 2),
          ('c', 'd', 5, 6),
          ('e', 'f', 0, 3),
          ('c', 'd', 7, 8),
          ('a', 'b', 9, 4)]

For rows in mylist that have the same identifiers, I would like to take the average of the third and fourth elements across the rows with the same identifiers and create a newlist like the one below.

newlist = [('a', 'b', 5, 3),
           ('c', 'd', 6, 7),
           ('e', 'f', 0, 3)]

Here is my working solution. But, there should be a more efficient solution. Since I implement this in a very large list (1.6m rows), I need something better.

Temp = [(i[0],i[1]) for i in mylist]
Temp2 = {i: [] for i in list(set(Temp))}
Temp3 = {i: [] for i in Temp2.keys()}
for i in Temp2.keys():
    for j in mylist:
        if i == (j[0],j[1]):
            Temp2[i].append(j[2])
            Temp3[i].append(j[3])
newlist = [(i[0],i[1],np.mean(Temp2[i]),np.mean(Temp3[i])) for i in Temp2.keys()]

Upvotes: 0

Views: 50

Answers (2)

Tom Harvey
Tom Harvey

Reputation: 4342

The key to your performance issue is to not loop over your list many times. Especially if it’s large.

The below will group the values by identifier and then get the means and reform the desired output


delimiter = "-" # choose something not in your identifier. 

grouped_data = {}
for (id1, id2, val1, val2) in mylist:
    identifier = (delimiter).join([id1, id2])
    try:
        grouped_data[identifier]["col3"].append(val1)
        grouped_data[identifier]["col4"].append(val2)
    except KeyError:
        grouped_data[identifier] = {}
        grouped_data[identifier]["col3"] = [val1]
        grouped_data[identifier]["col4"] = [val2]

output_list = []
for identifier, values in grouped_data.items():
    this_output = identifier.split(delimiter)
    this_output.append(sum(values["col3"]) / len(values["col3"]))
    this_output.append(sum(values["col4"]) / len(values["col4"]))
    output_list.append(this_output)

print(output_list)

Here we only loop across mylist once. We then loop across the grouped_data once; but that’s a much smaller list.

Upvotes: 1

alec
alec

Reputation: 6112

Can do it in a list comprehension

mylist = [('a', 'b', 1, 2),
          ('c', 'd', 5, 6),
          ('e', 'f', 0, 3),
          ('c', 'd', 7, 8),
          ('a', 'b', 9, 4)]

ids = [x[:2] for x in mylist]
newlist = [(*i, *[sum(x) / ids.count(i) for x in zip(*[x[2:] for x in mylist if x[:2] == i])]) for i in set(ids)]
print(newlist)

Upvotes: 1

Related Questions