Reputation: 645
Given a dataframe with the 'name' of a particular event, the year it occurred, and its type, which looks like:
index name year type extracolumns
0 'a' 2014 'X' stuff
1 'a' 2014 'X' stuff
2 'a' 2014 'Y' stuff
3 'a' 2014 'Y' stuff
4 'a' 2015 'X' stuff
5 'a' 2015 'X' stuff
6 'a' 2015 'Y' stuff
7 'b' 2014 'X' stuff
8 'b' 2015 'Y' stuff
9 'c' 2014 'Z' stuff
I would like a dataframe indexed by the event's 'name', that has the count of each 'type', by year. Ideally it will look like:
index type_X_2014 type_Y_2014 type_Z_2014 type_X_2015 type_Y_2015 type_Z_2015
'a' 2 2 0 2 1 0
'b' 1 0 0 0 1 0
'c' 0 0 1 0 0 0
This should have nunique(types)*nunique(years) columns. Obviously this will involve groupby and some value_counts. Is there a way to dynamically accomplish this, without hardcoding the column names? Thanks much.
Upvotes: 0
Views: 1288
Reputation: 879451
You could use pd.crosstab
to create a frequency table:
import sys
import pandas as pd
pd.options.display.width = sys.maxsize
df = pd.DataFrame({'extracolumns': ['stuff', 'stuff', 'stuff', 'stuff', 'stuff', 'stuff', 'stuff', 'stuff', 'stuff', 'stuff'], 'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 'name': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'c'], 'type': ['X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'X', 'Y', 'Z'], 'year': [2014, 2014, 2014, 2014, 2015, 2015, 2015, 2014, 2015, 2014]})
result = pd.crosstab(df['name'], [df['year'], df['type']], dropna=False)
result.columns = ['type_{}_{}'.format(typ,year) for year,typ in result.columns]
print(result)
yields
type_X_2014 type_Y_2014 type_Z_2014 type_X_2015 type_Y_2015 type_Z_2015
name
a 2 2 0 2 1 0
b 1 0 0 0 1 0
c 0 0 1 0 0 0
If you don't want to hardcode the column names, but you know the position (ordinal index) of the columns then you could use iloc
to reference the columns by position:
result = pd.crosstab(df.iloc[:,1], [df.iloc[:, 2], df.iloc[:, 3]])
The dropna=False
causes crosstab
to keep columns even if all the frequencies are all zero. This ensures that there are nunique(types)*nunique(years)
columns -- including type_Z_2015
.
Upvotes: 4
Reputation: 57033
First, prepare a summary dataframe:
u = df.groupby(['name','year','type']).count()\
.unstack(0).fillna(0).astype(int)
Unfortunately, it does not have the right index. Now, fix the index by concatenating the first- and second-level indexes:
u.index = ['type_{}_{}'.format(*kind[0]) for kind in zip(u.index)]
Unfortunately, it does not have the right columns. Now, fix the columns by removing the first-level indexes:
u.columns = u.columns.get_level_values(1)
u
#name a b c
#type_X_2014 2 1 0
#type_Y_2014 2 0 0
#type_Z_2014 0 0 1
#type_X_2015 2 0 0
#type_Y_2015 1 1 0
And transpose the rows and columns, if you wish:
u = u.T
Upvotes: 0