Reputation: 77
Here is my dataset:
Unique_ID No_of_Filings Req_1 Req_2 Req_3 Req_4
RCONF045 3 Blue Red White Violet
RCONF046 3 Blue Red White Brown
RCONF047 3 Blue Red White Brown
RCONF048 3 Black Yellow Green N/A
RCONF051 4 Black Yellow Green N/A
RCONF052 4 Black Brown Green Orange
I've extracted the unique values from the last 4 columns (Req_1 through Req_4) by the following:
pd.unique(df1[["Req_1","Req_2","Req_3","Req_4"]].values.ravel("K"))
Out[20]: array(['Blue', 'Black', 'Red', 'Yellow', 'Brown', 'White', 'Green',
'Violet', nan, 'Orange'], dtype=object)
Here's what I need for the output. Frequency = how many times it shows up in the last four columns (e.g. Yellow only shows up twice) and Number of Filings = sum(No_of_Filings if the requirement is in that row). For example, Blue is in the first three rows, so that's 3 + 3 + 3 = 9 and Brown is in the second, third, and sixth row, so it's 3 + 3 + 4 = 10
Requirements Frequency Number of Filings
Blue 3 9
Black 3 11
Red 3 9
Brown 3 10
White 3 9
Green 3 11
Yellow 2 7
N/A 2 7
Violet 1 3
Orange 1 4
How can I create those two columns in my newly-created dataframe above using pandas?
Thanks
Upvotes: 1
Views: 1533
Reputation: 323366
IIUC
df.drop('Unique_ID',1).melt('No_of_Filings').\
fillna('NaN').groupby('value')['No_of_Filings'].agg(['count','sum'])
Out[56]:
count sum
value
Black 3 11
Blue 3 9
Brown 3 10
Green 3 11
NaN 2 7
Orange 1 4
Red 3 9
Violet 1 3
White 3 9
Yellow 2 7
Upvotes: 2
Reputation: 51425
You could do something along these lines using agg
, but it takes some re-shaping beforehand. Here is a way to get it:
agg_df = (df.fillna('N/A').set_index(['Unique_ID', 'No_of_Filings'])
.stack()
.reset_index('No_of_Filings')
.groupby(0)
.agg(['sum', 'size'])
.reset_index())
agg_df.columns = ['Requirements', 'Number of Filings', 'Frequency']
>>> agg_df
Requirements Number of Filings Frequency
0 Black 11 3
1 Blue 9 3
2 Brown 10 3
3 Green 11 3
4 N/A 7 2
5 Orange 4 1
6 Red 9 3
7 Violet 3 1
8 White 9 3
9 Yellow 7 2
Upvotes: 2
Reputation: 59284
Notice that you can use np.unique
to get the count of the uniques right away (which already completes one of your objectives).
df= df.fillna('NA')
cols = ["Req_1","Req_2","Req_3","Req_4"]
u = pd.unique(df[cols].values.ravel("K"))
s = np.unique(df[cols].values.ravel("K"), return_counts=True)
df2 = pd.DataFrame({'colors': u}).fillna('N/A')
df2['freq'] = df2.colors.map(dict(zip(*s)))
df2['n'] = [df[(df[cols] == v).sum(1) >=1].No_of_Filings.sum() for v in df2.colors]
colors freq n
0 Blue 3 9
1 Black 3 11
2 Red 3 9
3 Yellow 2 7
4 Brown 3 10
5 White 3 9
6 Green 3 11
7 Violet 1 3
8 N/A 2 0
9 Orange 1 4
Upvotes: 2
Reputation: 294506
from collections import defaultdict
d = defaultdict(int)
for i, f, *r in df.values:
for v in r:
d[(v, 'filings')] += f
d[(v, 'frequency')] += 1
pd.Series(d).unstack().rename_axis('reqs').reset_index()
reqs filings frequency
0 NaN 7 2
1 Black 11 3
2 Blue 9 3
3 Brown 10 3
4 Green 11 3
5 Orange 4 1
6 Red 9 3
7 Violet 3 1
8 White 9 3
9 Yellow 7 2
Upvotes: 2