Reputation: 11
I am new to Pandas and trying to figure out how to create/use functions, perform simple if-else statements. I wrote the following function average_shares()
, but it doesn't work as I expect (because it should be written for rows, not columns, but I don't get it/ don't know how to fix the following code. In a simple non-pandas world I would write for i in range(len(column)): do the following
. In pandas tutorials they didn't use for loops, they used apply()
; they created functions with a parameter row
(why not dataframe df
as parameter?). I tried to do the same and it didn't work. So overall, I just don't know how to do a simple thing in pandas and I hoped maybe someone could show me.
df = pd.DataFrame({'id': [1, 1, 1], 'week': [1, 1, 2], 'order': [2, 4, 3]})
df['index'] = df.groupby(['id', 'week']).cumcount()+1 #can be many id-s (not only id=1)
df['share'] = df['order']/df.groupby(['id', 'week'])['order'].transform('sum')
df['len'] = (df.groupby(['id', 'week']))['order'].transform('count')
def average_shares(df):
if df['repeats']>2:
df = df[(df['share'])>0.05 & (df['share']<0.95)]
df['share_1'] = df[df['index']==1].groupby('id')['share'].transform('mean')
df['share_2'] = df[df['index']==2].groupby('id')['share'].transform('mean')
elif df['repeats']==2:
df['share_1'] = df[df['index']==1].groupby('id')['share'].transform('mean')
df['share_2'] = df[df['index']==2].groupby('id')['share'].transform('mean')
elif df['repeats']<2:
df['share_1'] = df['share']
df['share_2'] = 0
return df['share_1'], df['share_2']
df['share_1'], df['share_2'] = zip(*df.apply(average_shares, axis=1))
The task sounds as
In the dataframe there are columns: id, week, order, share. Each week can be repeated more than once. For each id, share_1 is an average share of the first occurrences of weeks, and share_2 is an average share of the second occurrences. If for some id a week is repeated more than 2 times, then before averaging, the rows with a share >0.95 and <0.05 are discarded. If some week is not repeated, then the average share of the first occurrence is equal to the share itself, the average share of the second occurrence is 0.
Here is the data:
|input |calculated |I need to calculate |
|-------------|-----------------|---------------- ---------------|
|id|week|order|index|share |len|average share_1 |average share_2|
|--|----|-----|-----|-------|---|----------------|---------------|
|1 | 1 |2 |1 |2/(2+4)|2 |(2/6+1)/2 = 0.66|(4/6+0)/2 =0.33|
|1 | 1 |4 |2 |4/(2+4)|2 |(2/6+1)/2 = 0.66|(4/6+0)/2 =0.33|
|1 | 2 |3 |1 |1 |1 |(2/6+1)/2 = 0.66|(4/6+0)/2 =0.33|
Upvotes: -1
Views: 153
Reputation: 15211
trying to figure out how to create/use functions, perform simple if-else statements
This already needs to be re-thought. Numpy/Pandas don't generally talk in "if/else" - they talk in boolean predicate vectors, essentially arrays of True/False. This is done for performance purposes.
In pandas tutorials they didn't use for loops, they used
apply()
Both of those should be avoided. Instead use direct vectorised methods.
I tried to do the same and it didn't work. So overall, I just don't know how to do a simple thing in pandas
To be fair, what you're doing is not very simple. It's a fairly tricky multi-step operation.
You also need to significantly expand your input dataset to cover more test cases, including longer ID groups that both meet and do not meet the 5% criterion.
For example,
import pandas as pd
# More rows have been added for the 5% test case.
df = pd.DataFrame({
'id': (1, 1, 1, 2, 2, 2, 2),
'week': (1, 1, 2, 2, 2, 2, 1),
'order': (2, 4, 3, 40, 5, 1, 6),
})
# share is order divided by the sum of the week orders
by_id_week = df.groupby(['id', 'week'])
df['share'] = df['order']/by_id_week['order'].transform('sum')
# If for some ID a week is repeated more than two times, then before
# averaging, discard the rows with a share < 0.05 or > 0.95
to_average = df.copy()
to_average['n_weeks'] = by_id_week['order'].transform('count')
to_average.query(
expr='(n_weeks <= 2) | ((share >= 0.05) & (share <= 0.95))',
inplace=True,
)
to_average['week_idx'] = to_average.groupby(['id', 'week']).cumcount()
# share_1 is an average share of the first occurrences of weeks
# share_2 is an average share of the second occurrences
# etc.
week_count = 2
means_by_idx = (
(
to_average.groupby(['id', 'week_idx'])['share']
.sum() / week_count
)
.unstack('week_idx') # One column per week index
.loc[:, :2] # Only care about the first and second week
.rename(columns={0: 'share_1', 1: 'share_2'})
)
# At this point, means_by_idx has one row per ID, and one column per week index.
# The row index is equal to the ID value, so we can merge on it.
df = pd.merge(
left=df, right=means_by_idx,
left_on='id', right_index=True,
)
print(df)
Upvotes: 0