Reputation: 3382
I have a list of lists - representing a table with 4 columns and many rows (10000+).
Each sub-list contains 4 variables.
Here is a small part of my table:
['1810569', 'a', 5, '1241.52']
['1437437', 'a', 5, '1123.90']
['1437437', 'b', 5, '1232.43']
['1810569', 'b', 5, '1321.31']
['1810569', 'a', 5, '1993.52']
The first column represents house-hold ID, and the second represents member id in the household.
The fourth column represents weights that I want to sum - distinctly for each member.
For the example above I want the output to be:
['1810569', 'a', 5, '3235.04']
['1437437', 'a', 5, '1123.90']
['1437437', 'b', 5, '1232.43']
['1810569', 'b', 5, '1321.31']
In another words - to sum the weights in lines 1 and 5 since they are weights of the same user - while all the other users are distinct.
I saw something about group by in pandas - but I didn't understand how exactly to use it for my problem.
Upvotes: 1
Views: 1208
Reputation: 180540
You could do it with a dict, using the first three elements as keys to group the data by:
d = {}
for k, b, c, w in l:
if (k, b, c) in d:
d[k, b, c][-1] += float(w)
else:
d[k, b, c] = [k, b, c, float(w)]
from pprint import pprint as pp
pp(list(d.values()))
Output:
[['1810569', 'b', 5, 1321.31],
['1437437', 'b', 5, 1232.43],
['1437437', 'a', 5, 1123.9],
['1810569', 'a', 5, 3235.04]]
If you wanted to maintain a first seen order:
from collections import OrderedDict
d = OrderedDict()
for k, b, c, w in l:
if (k, b, c) in d:
d[k, b, c][-1] += float(w)
else:
d[k, b, c] = [k, b, c, float(w)]
from pprint import pprint as pp
pp(list(d.values()))
Output:
[['1810569', 'a', 5, 3235.04],
['1437437', 'a', 5, 1123.9],
['1437437', 'b', 5, 1232.43],
['1810569', 'b', 5, 1321.31]]
Upvotes: 0
Reputation: 394459
Assuming the following is your list then the following would work:
In [192]:
l=[['1810569', 'a', 5, '1241.52'],
['1437437', 'a', 5, '1123.90'],
['1437437', 'b', 5, '1232.43'],
['1810569', 'b', 5, '1321.31'],
['1810569', 'a', 5, '1993.52']]
l
Out[192]:
[['1810569', 'a', 5, '1241.52'],
['1437437', 'a', 5, '1123.90'],
['1437437', 'b', 5, '1232.43'],
['1810569', 'b', 5, '1321.31'],
['1810569', 'a', 5, '1993.52']]
In [201]:
# construct the df and convert the last column to float
df = pd.DataFrame(l, columns=['household ID', 'Member ID', 'some col', 'weights'])
df['weights'] = df['weights'].astype(float)
df
Out[201]:
household ID Member ID some col weights
0 1810569 a 5 1241.52
1 1437437 a 5 1123.90
2 1437437 b 5 1232.43
3 1810569 b 5 1321.31
4 1810569 a 5 1993.52
So we can now groupby
on the household and member id and call sum
on the 'weights' column:
In [200]:
df.groupby(['household ID', 'Member ID'])['weights'].sum().reset_index()
Out[200]:
household ID Member ID weights
0 1437437 a 1123.90
1 1437437 b 1232.43
2 1810569 a 3235.04
3 1810569 b 1321.31
Upvotes: 2