Reputation: 187
Does anyone have suggestions why I can't run a sum on the last dataframe?
Also open to suggestions if there's a shorter way to accomplish splitting out the tags and summing the frequencies.
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)pandas_profiling
print("\nCalculate aggregates for tags :\n")
TagsDFGroupBy = df.groupby(['Tags','Lab Location' ]).agg({'ADO ID': ['count']}).rename(columns={'ADO ID':'WorkItemCnt'}).reset_index()
print(TagsDFGroupBy)
Produces the output
Tags | Labs | WorkItemCNT | --------------------------------------------------------------| ----- | --- |
0| A2040|RXY Lab|1 |
1| AWAITING COMMODITY QUAL|RXY Lab|1 |
2| DNR|RXY Lab|18 |
3| DNR; MISSING SKU DOC|RXY Lab|17 |
4| MISSING QUAL INTAKE REQUEST; MISSING SKU DOC; NEED HARDWARE|QXR Lab|1 |
5| MISSING SKU DOC|RXY Lab|2 |
6| MISSING SKU DOC; NEED HARDWARE|RXY Lab|1 |
7| MISSING SKU DOC; NEED RA|RXY Lab|1 |
8| NEED HARDWARE|RXY Lab|7 |
9| NEED HARDWARE|VYZ Lab|4 |
And then I run code to split out the tags and sum the frequencies
print("\nSplit tags by semicolumn delimiter" )
TagsDFGroupBy[['Tag1','Tag2','Tag3']] = TagsDFGroupBy.Tags.str.split(";",expand=True)
print("\nReplace none with blanks")
mask = TagsDFGroupBy.applymap(lambda x: x is None)
cols = TagsDFGroupBy.columns[(mask).any()]
for col in TagsDFGroupBy[cols]:
TagsDFGroupBy.loc[mask[col], col] = ''
print("\n3 different dataframes")
TagsDFGroupBy1 = TagsDFGroupBy[['Lab Location','Tag1','WorkItemCnt']].rename(columns={'Tag1':'TagSplit'})
TagsDFGroupBy2 = TagsDFGroupBy[['Lab Location','Tag2','WorkItemCnt']].rename(columns={'Tag2':'TagSplit'})
TagsDFGroupBy3 = TagsDFGroupBy[['Lab Location','Tag3','WorkItemCnt']].rename(columns={'Tag3':'TagSplit'})
print("\nCombine 3 different dataframes into 1")
TagsConcat = pd.concat([TagsDFGroupBy1, TagsDFGroupBy2, TagsDFGroupBy3], ignore_index=True)
# Get names of indexes for which TagSplit has a blank value
indexNames = TagsConcat[TagsConcat['TagSplit'] == '' ].index
# Delete these row indexes from dataFrame
TagsConcat.drop(indexNames , inplace=True)
TagsConcat.reset_index()
print('TagsConcat')
print(TagsConcat)
Which produces this output
Lab Location TagSplit WorkItemCnt
count
--------------|-------------------------------------- | ----------|
0 RXY LAB| A2040 |1
1 RXY LAB| AWAITING COMMODITY QUAL |1
2 RXY LAB| DNR |18
3 RXY LAB| DNR |17
4 QXR LAB| MISSING QUAL INTAKE REQUEST |1
5 RXY LAB| MISSING SKU DOC |2
6 RXY LAB| MISSING SKU DOC |1
7 RXY LAB| MISSING SKU DOC |1
8 RXY LAB| NEED HARDWARE |7
9 VYZ LAB| NEED HARDWARE |4
13 RXY LAB| MISSING SKU DOC |17
14 QXR LAB| MISSING SKU DOC |1
16 RXY LAB| NEED HARDWARE |1
17 RXY LAB| NEED RA |1
24 QXR LAB| NEED HARDWARE |1
Finally, I try to use either
TagsFinal.groupby(['Lab Location', 'TagSplit'])['WorkItemCnt'].sum()
or
TagsFinal = TagsConcat.groupby(['Lab Location', 'TagSplit']).agg({'WorkItemCnt': ['sum']})
and I get this error:
KeyError: 'WorkItemCnt'
Upvotes: 1
Views: 409
Reputation: 862771
I think your code is possible simplify - first are splitted column Tags
with DataFrame.explode
, and then is aggregate counts by GroupBy.size
:
TagsFinal = (df.assign(TagSplit = df['Tags'].str.split('; '))
.explode('TagSplit')
.groupby(['Labs', 'TagSplit'])
.size()
.reset_index(name='WorkItemCnt'))
print (TagsFinal)
Labs TagSplit WorkItemCnt
0 QXR Lab MISSING QUAL INTAKE REQUEST 1
1 QXR Lab MISSING SKU DOC 1
2 QXR Lab NEED HARDWARE 1
3 RXY Lab A2040 1
4 RXY Lab AWAITING COMMODITY QUAL 1
5 RXY Lab DNR 2
6 RXY Lab MISSING SKU DOC 4
7 RXY Lab NEED HARDWARE 2
8 RXY Lab NEED RA 1
9 VYZ Lab NEED HARDWARE 1
Upvotes: 1