Rachael
Rachael

Reputation: 11

Performing calculations on subset of data frame subset in Python

user_id   char_id   rating
100          33          3
100          44          2
100          33          1
100          44          4
111          55          5
111          44          4
111          55          5

I have a data frame formatted similarly to this one and am trying to perform calculations on the ratings after they have been grouped by user_id and char_id. It doesn't work but I need to do something like data.groupby('user_id', 'char_id') and then calculate the moving average for each char_id for each user_id. Any help? I have several thousand user_id so I can't go through and select one at a time for the calculations.

I need to somehow iterate over the user_id column and group all the same user_ids together, and save that format so that user_ids are separate. Then I need to do the same thing, iterating over char_id for each user_id subset and saving that format so that I can finally perform calculations on the subsets of subsets of ratings. So far all my attempts have been unsuccessful. The closest I came was:

def divide_by_user(data):
    for user in data['user_id']:
        user_data = data.where(data['user_id'] == user)
        return user_data

Upvotes: 1

Views: 2789

Answers (2)

Germán Alfaro
Germán Alfaro

Reputation: 600

Try this: "df" is the dataFrame

mean=pd.rolling_mean(df.rating, 7)

Upvotes: -1

Marius
Marius

Reputation: 60160

There's no need to do this manually, creating and summarizing subsets like this is exactly what DataFrame.groupby() is for. Create your groupby:

grouped = df.groupby(['user_id', 'char_id'])

Then you can apply a function to each subset. It sounds like you want either rolling_mean or expanding_mean, both of which are already available in pandas:

df['cum_average'] = grouped['rating'].apply(pd.expanding_mean)
# New column now contains the average rating for each subset,
#   including all values that have been seen so far.
df
Out[43]: 
   user_id  char_id  rating  cum_average
0      100       33       3            3
1      100       44       2            2
2      100       33       1            2
3      100       44       4            3
4      111       55       5            5
5      111       44       4            4
6      111       55       5            5

Using a larger randomly-generated dataset to demonstrate rolling_window():

df = pd.DataFrame({
    'user_id': [random.choice([100, 111, 112]) for n in range(n_rows)],
    'char_id': [random.choice([33, 44, 55]) for n in range(n_rows)],
    'rating': [random.choice([1, 2, 3, 4, 5]) for n in range(n_rows)]
})
grouped = df.groupby(['user_id', 'char_id'])
df['cum_average'] = grouped['rating'].apply(pd.rolling_mean, window=7)
# Output. The rolling average will be NaN until enough values have been
#   observed for that subset, you can change this using the
#   min_periods argument to rolling_window
df.sort(columns=['user_id', 'char_id'])
     char_id  rating  user_id  cum_average
3         33       1      100          NaN
19        33       2      100          NaN
22        33       5      100          NaN
34        33       1      100          NaN
47        33       1      100          NaN
48        33       1      100          NaN
49        33       1      100     1.714286
51        33       4      100     2.142857
55        33       2      100     2.142857
60        33       2      100     1.714286
66        33       2      100     1.857143
...
etc.

Upvotes: 2

Related Questions