Reputation: 1
I need some help to calculate 4 metrics for my table. What is the correct way to constantly refer to the same string within the department?
I have been trying to solve this problem for a very long time, but I can not come to a logical solution. I have tried Partition from SQLalchemy, but my skills wasn't enough. I tried the def function, but functions are my weak point.
Therefore, I ask you for help with a solution or advice. Thank you!!!
my real task is not about salaries)))) don't worry about it!
EXAMPLE:
import pandas as pd
df2=pd.DataFrame({
'head_department': [ 'Anna','Anna', 'Anna','Anna','Anna',
'John', 'John','John',
'Denis', 'Denis',
'Sarah',
'Greg', 'Greg','Greg','Greg', 'Greg','Greg'],
'serial_number': [1,2,3,4,5,1,2,3,1,2,1,1,2,3,4,5,6],
'Worker_in_dpt': ['Anna','Anna-s_worker_2','Anna-s_worker_3','Anna-s_worker_4','Anna-s_worker_5',
'John','John-s worker_2','John-s_worker_3',
'Denis','Denis-s_worker_2',
'Sarah',
'Greg','Greg-s_worker_2', 'Greg-s_worker_3', 'Greg-s_worker_4','Greg-s_worker_5', 'Greg-s_worker_6'],
'jan_salary': [1000, 600, 600, 500, 900, None, 600, 500, 1200, 800, 1400, None, 700, 600, 600, 450, 700],
'feb_salary': [1100, 700, 700, 700, 800, None, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 700],
'mar_salary': [1200, 800, 800, 900, 700, 1300, 600, 500, 1800, 600, 1100, 1600, 400, 700, 600, 250, 700],
'apr_salary': [1600, 900, 900, 700, 700, 2300, 500, 400, 1800, 900, 1100, 1900, 200, 900, 500, 150, 700],
'may_salary': [1100, 700, 700, 700, 800, 2300, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 700],
'jun_salary': [1200, 800, 800, 900, 700, 1300, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 800],
'jul_salary': [1000, 600, 600, 500, 900, 1300, 600, 500, 1200, 800, 1400, 1200, 700, 600, 600, 450, 700],
'aug_salary': [1100, 700, 700, 700, 800, 2300, 600, 500, 1800, 600, 1100, 1600, 400, 700, 600, 250, None]
})
df2['serial_number'] = df2['serial_number'].astype('str') # else it computed as int, when i use numeric_only=True
df2[ 'mean_salary_per_period'] = df2.mean(numeric_only=True, axis=1)
df2
i'm trying calculate columns:
WHERE:
df2['corr with head in dpt'] - Calculation of the correlation of the salary of each worker for the entire period within the department with the head of this department (1), the flag of the head of the department is always = 1!
df2['mean _when_1_in _NAN'] - the average salary of each worker for the period while the head of department is NAN. if the head of department has no NAN, then the average for the all period.
df2['mean _when_1_in_NAN'] - the average salary of each worker for the period starting from the first month, when the salary appeared at the head of the department. if the head of department had no NA then the average for the all period.
df2['sum all dpt in Jan'] - the sum of all workers department salaries in January, including the head of department's, even it's NAN too.
Thank you!!!
Upvotes: 0
Views: 55
Reputation: 2822
The following will give you the desired output, using groupby.apply()
with a function:
df = df2.loc[:, ~df2.columns.isin(
["Worker_in_dpt", "mean_salary_per_period"])] \
.set_index(['head_department', 'serial_number']).stack(dropna=False)
def func(frame):
# unstack for serial number as columns
frame = frame.unstack(level=["serial_number"])
# correlations and take first row
corr = frame.corr().iloc[0]
# find months that are nan for serial number 1 (first column)
nan = frame.iloc[:, 0].isna()
# mean of non-nans
mean_not_nan = frame.loc[~nan, :].mean(axis=0)
# if nan, mean for these, else same as non-nans
if nan.sum() > 0:
mean_nan = frame.loc[nan, :].mean(axis=0)
else:
mean_nan = mean_not_nan.copy()
# sum for january
sum_jan = pd.Series(data=frame.loc[
frame.index.get_level_values(-1).str.contains("jan")].sum(axis=1)[0],
index=frame.columns)
# concat all series and name columns
return pd.concat([
corr, mean_nan, mean_not_nan, sum_jan], axis=1) \
.set_axis(
['corr with head in dpt',
'mean _when_1_in _NAN',
'mean _when_1_in_NAN',
'sum all dpt in Jan'], axis=1)
# groupby head_department and apply func
additional_cols = df.groupby(level="head_department").apply(func)
# merge df2 with grouped outputs
out = pd.merge(df2, additional_cols,
left_on=['head_department', 'serial_number'],
right_index=True)
Upvotes: 1