Reputation: 592
I have the following list of tuples. I call the first two elements of the tuples identifiers.
mylist = [('a', 'b', 1, 2),
('c', 'd', 5, 6),
('e', 'f', 0, 3),
('c', 'd', 7, 8),
('a', 'b', 9, 4)]
For rows in mylist
that have the same identifiers, I would like to take the average of the third and fourth elements across the rows with the same identifiers and create a newlist
like the one below.
newlist = [('a', 'b', 5, 3),
('c', 'd', 6, 7),
('e', 'f', 0, 3)]
Here is my working solution. But, there should be a more efficient solution. Since I implement this in a very large list (1.6m rows), I need something better.
Temp = [(i[0],i[1]) for i in mylist]
Temp2 = {i: [] for i in list(set(Temp))}
Temp3 = {i: [] for i in Temp2.keys()}
for i in Temp2.keys():
for j in mylist:
if i == (j[0],j[1]):
Temp2[i].append(j[2])
Temp3[i].append(j[3])
newlist = [(i[0],i[1],np.mean(Temp2[i]),np.mean(Temp3[i])) for i in Temp2.keys()]
Upvotes: 0
Views: 50
Reputation: 4342
The key to your performance issue is to not loop over your list many times. Especially if it’s large.
The below will group the values by identifier and then get the means and reform the desired output
delimiter = "-" # choose something not in your identifier.
grouped_data = {}
for (id1, id2, val1, val2) in mylist:
identifier = (delimiter).join([id1, id2])
try:
grouped_data[identifier]["col3"].append(val1)
grouped_data[identifier]["col4"].append(val2)
except KeyError:
grouped_data[identifier] = {}
grouped_data[identifier]["col3"] = [val1]
grouped_data[identifier]["col4"] = [val2]
output_list = []
for identifier, values in grouped_data.items():
this_output = identifier.split(delimiter)
this_output.append(sum(values["col3"]) / len(values["col3"]))
this_output.append(sum(values["col4"]) / len(values["col4"]))
output_list.append(this_output)
print(output_list)
Here we only loop across mylist once. We then loop across the grouped_data
once; but that’s a much smaller list.
Upvotes: 1
Reputation: 6112
Can do it in a list comprehension
mylist = [('a', 'b', 1, 2),
('c', 'd', 5, 6),
('e', 'f', 0, 3),
('c', 'd', 7, 8),
('a', 'b', 9, 4)]
ids = [x[:2] for x in mylist]
newlist = [(*i, *[sum(x) / ids.count(i) for x in zip(*[x[2:] for x in mylist if x[:2] == i])]) for i in set(ids)]
print(newlist)
Upvotes: 1