Reputation: 463
I have a dataset like below and i need all the different weights for each category in single row and the count
Sample_data
category weights
1 aa 3.2
2 aa 2.2
3 aa 4.2
4 bb 3.5
5 bb 4.5
6 aa 0.5
7 cc 0.6
8 bb 7.5
9 cc 6.6
10 dd 2.2
11 aa 3.3
12 bb 4.4
13 cc 5.5
14 dd 6.6
And what i need is the count of each unique category and the different weights of each category in the same row.
Expected output:
category count weight1 weight2 weight3 weight4 weight5
1 aa 5 3.2 2.2 4.2 0.5 3.3
2 bb 4 3.5 4.5 7.5 4.4
3 cc 3 0.6 6.6 5.5
4 dd 2 2.2 6.6
I thought
sampledata['category'].groupby(level = 0)
will work but it is not. Can some one help me how to do this in python.
Upvotes: 3
Views: 323
Reputation: 2320
You could also use unstack
after resetting the index of each group:
dfw = df.groupby(['category'])['weights'].apply(lambda i: i.reset_index(drop=True)).unstack(level=1)
The size (here the 'count'
) of each group is df.groupby(['category']).size()
.
dfw.rename(columns=lambda x: 'weight'+ str(x+1), inplace=True) # string concatenation to give column labels
dfw.insert(0, 'count', df.groupby(['category']).size()) #insert count at position 0
This yields:
count weight1 weight2 weight3 weight4 weight5
category
aa 5 3.2 2.2 4.2 0.5 3.3
bb 4 3.5 4.5 7.5 4.4 NaN
cc 3 0.6 6.6 5.5 NaN NaN
dd 2 2.2 6.6 NaN NaN NaN
Upvotes: 1
Reputation: 394071
I could probably shorten this but the following works:
In [51]:
cat = df.groupby('category')['weights'].agg({'count':'count', 'weight_cat':lambda x: list(x)}).reset_index()
cat
Out[51]:
category count weight_cat
0 aa 5 [3.2, 2.2, 4.2, 0.5, 3.3]
1 bb 4 [3.5, 4.5, 7.5, 4.4]
2 cc 3 [0.6, 6.6, 5.5]
3 dd 2 [2.2, 6.6]
In [52]:
cat = cat.join(cat['weight_cat'].apply(lambda x: pd.Series(x)))
cat
Out[52]:
category count weight_cat 0 1 2 3 4
0 aa 5 [3.2, 2.2, 4.2, 0.5, 3.3] 3.2 2.2 4.2 0.5 3.3
1 bb 4 [3.5, 4.5, 7.5, 4.4] 3.5 4.5 7.5 4.4 NaN
2 cc 3 [0.6, 6.6, 5.5] 0.6 6.6 5.5 NaN NaN
3 dd 2 [2.2, 6.6] 2.2 6.6 NaN NaN NaN
In [68]:
rename_cols = [col for col in cat if type(col) == int]
rename_weight_cols = ['weight'+str(col + 1) for col in rename_cols]
d = dict(zip(rename_cols, rename_weight_cols))
cat.rename(columns = d,inplace=True)
cat
Out[68]:
category count weight_cat weight1 weight2 weight3 \
0 aa 5 [3.2, 2.2, 4.2, 0.5, 3.3] 3.2 2.2 4.2
1 bb 4 [3.5, 4.5, 7.5, 4.4] 3.5 4.5 7.5
2 cc 3 [0.6, 6.6, 5.5] 0.6 6.6 5.5
3 dd 2 [2.2, 6.6] 2.2 6.6 NaN
weight4 weight5
0 0.5 3.3
1 4.4 NaN
2 NaN NaN
3 NaN NaN
So what the above does is first group on the 'category' column and perform an aggregation on the weight column, we create a count column and then we turn all the values for that group into a list and add this.
I then call apply
on that list to turn it into a Series, this will auto generate the names of the columns 0..4.
I then create a dict to rename the columns to weight1 through to 5 as desired.
Upvotes: 2
Reputation: 1741
Given that your sample data is in the form of a list of dictionaries called data
, where each dictionary has a category
key and a weight
key, then the following code will provide you with what you need:
trans_data = {}
for item in data:
if item['category'] in trans_data:
trans_data[item['category']]['count'] += 1
trans_data[item['category']]['weights'].append(item['weight'])
else:
trans_data[item['category']] = {'count': 1, 'weights': [item['weight'],]}
The data
data structure is assumed to be of the following form:
data = [{'category': 'aa', 'weight': 3.2}, {'category': 'bb', 'weight': 2.2}, {'category': 'aa', 'weight': 1.1}]
Upvotes: 0