Reputation: 2939
I currently have some dataset that is structured as follows:
data = {'participant': [100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
'step_name': ['first', 'first', 'second', 'third', 'second', 'first', 'first', 'first', 'second', 'third'],
'title': ['acceptable', 'acceptable', 'not acceptable', 'acceptable', 'not acceptable', 'acceptable', 'not acceptable', 'acceptable', 'acceptable', 'acceptable'],
'colour': ['blue', 'blue', 'blue', 'green', 'green', 'blue', 'green', 'blue', 'blue', 'green'],
'class': ['A', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'A', 'B']}
df = pd.DataFrame(data, columns=['participant', 'step_name', 'title', 'colour', 'class'])
which looks like:
+----+---------------+-------------+----------------+----------+---------+
| | participant | step_name | title | colour | class |
|----+---------------+-------------+----------------+----------+---------|
| 0 | 100 | first | acceptable | blue | A |
| 1 | 101 | first | acceptable | blue | B |
| 2 | 102 | second | not acceptable | blue | B |
| 3 | 103 | third | acceptable | green | A |
| 4 | 104 | second | not acceptable | green | B |
| 5 | 105 | first | acceptable | blue | A |
| 6 | 106 | first | not acceptable | green | A |
| 7 | 107 | first | acceptable | blue | A |
| 8 | 108 | second | acceptable | blue | A |
| 9 | 109 | third | acceptable | green | B |
+----+---------------+-------------+----------------+----------+---------+
Now I want to aggregate the dataset so that each row counts each of the repeat variables, which I've currently managed to do along two variables (step_name
and title
) as follows:
count_df = df[['participant', 'step_name', 'title']].groupby(['step_name', 'title']).count()
count_df = count_df.unstack()
count_df.fillna(0, inplace=True)
count_df.columns = count_df.columns.get_level_values(1)
count_df
+--------+--------------+------------------+
| | acceptable | not acceptable |
|--------+--------------+------------------|
| first | 4 | 1 |
| second | 1 | 2 |
| third | 2 | 0 |
+--------+--------------+------------------+
Now though, I'd like to have an extra set of columns that includes the values for the other variables(colour
and class
) -- basically, I want to group and then unstack on those variables, but am not sure how to do it with more than 2 variables. Ultimately, I'd like for my final table to look like this:
+------+------+--------+--------------+------------------+
|class |colour| step | acceptable | not acceptable |
|----------------------+--------------+------------------|
| A | blue | first | 3 | 0 |
| B | blue | first | 1 | 0 |
| A |green | first | 0 | 1 |
| B |green | first | 0 | 0 |
| A | blue | second | 1 | 0 |
| B | blue | second | 0 | 1 |
| A |green | second | 0 | 0 |
| B |green | second | 0 | 1 |
| A |blue | third | 0 | 0 |
| B |blue | third | 0 | 0 |
| A |green | third | 1 | 0 |
| B |green | third | 1 | 0 |
+------+------+--------+--------------+------------------+
How do I reshape my data so that it looks like my final example? Do I still use the unstack and group functions?
Upvotes: 7
Views: 9042
Reputation: 863256
I think you need pivot_table
with aggfunc=len
, reset_index
and rename_axis
(new in pandas
0.18.0
):
df = df.pivot_table(index=['class','colour','step_name'],
columns='title',
aggfunc=len,
values='participant',
fill_value=0).reset_index().rename_axis(None, axis=1)
print df
class colour step_name acceptable not acceptable
0 A blue first 3 0
1 A blue second 1 0
2 A green first 0 1
3 A green third 1 0
4 B blue first 1 0
5 B blue second 0 1
6 B green second 0 1
7 B green third 1 0
Upvotes: 7
Reputation: 210922
you can use pivot_table() for this:
In [130]: df['count'] = 1
In [134]: (df.pivot_table(index=['class','colour','step_name'], columns='title',
.....: values='count', aggfunc='sum', fill_value=0)
.....: .reset_index()
.....: )
Out[134]:
title class colour step_name acceptable not acceptable
0 A blue first 3 0
1 A blue second 1 0
2 A green first 0 1
3 A green third 1 0
4 B blue first 1 0
5 B blue second 0 1
6 B green second 0 1
7 B green third 1 0
Upvotes: 5