Reputation: 12519
I have a dataframe that looks like the below (my actual dataframe is around 200k rows). Each row is a unique user's distance from the origin point and the time since they logged in last.
+------------+------------------+
| time_since | dist_from_origin |
+------------+------------------+
| 1 | 0.12 |
| 0 | 71.15 |
| 4 | 74.48 |
| 2 | 19.69 |
| 3 | 0.12 |
| 0 | 132.92 |
| 1 | 174.45 |
| 4 | 0.12 |
| 3 | 81.07 |
| 2 | 65.43 |
| 1 | 162.58 |
| 0 | 0.12 |
| 3 | 0.12 |
| 4 | 20.68 |
| 3 | 109.93 |
| 2 | 100.87 |
| 0 | 131.50 |
| 0 | 132.74 |
| 4 | 181.35 |
| 3 | 150.55 |
+------------+------------------+
I would like to create a new dataframe which summarizes the data like the example below. Where column 1 is the number of days inactive, column two is the count of users (rows?) in the group and each following column is a count of the distances that fit into that particular criteria.
+---------------+-----------+-----+-------------+--------------+---------------+----------------+----------------+
| days inactive | num users | <=5 | >5 and <=25 | >25 and <=50 | >50 and <=100 | >100 and <=150 | >150 and <=225 |
+---------------+-----------+-----+-------------+--------------+---------------+----------------+----------------+
| 0 | 5 | 1 | 0 | 0 | 1 | 3 | 0 |
| 1 | 3 | 1 | 0 | 0 | 0 | 0 | 2 |
| 2 | 3 | 0 | 1 | 0 | 1 | 1 | 0 |
| 3 | 5 | 2 | 0 | 0 | 1 | 1 | 1 |
| 4 | 4 | 1 | 1 | 0 | 1 | 0 | 1 |
+---------------+-----------+-----+-------------+--------------+---------------+----------------+----------------+
I tried to accomplish the summary dataframe by looping through the groups in a groupby object, where the object was grouped by time since
but I had some difficulty creating the summaries for each group and appending to a new data frame.
For example. I am hitting errors with one of the first steps: counting the rows in each group:
grouped = df.groupby(df['time_since'])
for group in grouped:
print group['time_since'].count()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-2182867bdacc> in <module>()
1 for group in grouped:
----> 2 print group['time_since'].count()
TypeError: tuple indices must be integers, not str
I was thinking another option would be to create boolean masks for each of my cases and then summarize each mask.
What I'm asking is, what is the simplest way to go from my original dataframe to the summary dataframe?
Upvotes: 1
Views: 698
Reputation: 76927
Here's one way to do it.
Create your BINS
In [144]: BINS = [0,5,25,50,100,150,225]
Then, group by on 'time_since'
and buckets pd.cut(df['dist_from_origin'], BINS)]
In [145]: (df.groupby(['time_since', pd.cut(df['dist_from_origin'], BINS)])
.size().unstack('dist_from_origin').fillna(0))
Out[145]:
dist_from_origin (0, 5] (100, 150] (150, 225] (5, 25] (50, 100]
time_since
0 1 3 0 0 1
1 1 0 2 0 0
2 0 1 0 1 1
3 2 1 1 0 1
4 1 0 1 1 1
Upvotes: 2