Reputation: 601
I'm trying to get some row data as columns with pandas.
My original dataframe is something like the following (with a lot more columns). Most data repeats for the same employee but some info changes, like salary in this example. Employees have different number of entries (in this case employee 1 has two entries, 2 has 4, and so on).
employee_id salary other1 other2 other3
1 50000 somedata1 somedata2 somedata3
1 48000 somedata1 somedata2 somedata3
2 80000 somedata20 somedata21 somedata22
2 77000 somedata20 somedata21 somedata22
2 75000 somedata20 somedata21 somedata22
2 74000 somedata20 somedata21 somedata22
3 60000 somedata30 somedata31 somedata32
I'm trying to get something like the following. Salary data should span a few columns and use the last available salary for employees with fewer entries (the repeated salary values in this example).
employee_id salary prevsalary1 prevsalary2 prevsalary3 other1 other2 other3
1 50000 48000 48000 48000 somedata1 somedata2 somedata3
2 80000 77000 75000 74000 somedata20 somedata21 somedata22
3 60000 60000 60000 60000 somedata30 somedata31 somedata32
I tried grouping
df.groupby(["employee_id"])['salary'].nlargest(3).reset_index()
But I dont get all columns. I can't find a way to preserve the rest of columns. Do I need to merge, concatenate or something like that with the original dataframe?
Also, I get a column named "level_1". I think I could get rid of it by using reset_index(level=1, drop=True)
but I believe this doesn't return a dataframe.
And finally, I guess if I get this grouping right, there's one more step to get the columns... maybe using pivot
or unstack
?
I'm starting my journey into machine learning and I keep scratching my head with this one, I hope you can help me :)
Creating dataset:
df = pd.DataFrame({'emp_id':[1,1,2,2,2,2,3],'salary':[50000,48000,80000,77000,75000,74000,60000]})
df['other1'] =['somedata1','somedata1','somedata20','somedata20','somedata20','somedata20','somedata30']
df['other2'] = df['other1'].apply(lambda x: x+'1')
df['other3'] = df['other1'].apply(lambda x: x+'2')
df
Out[59]:
emp_id salary other1 other2 other3
0 1 50000 somedata1 somedata11 somedata12
1 1 48000 somedata1 somedata11 somedata12
2 2 80000 somedata20 somedata201 somedata202
3 2 77000 somedata20 somedata201 somedata202
4 2 75000 somedata20 somedata201 somedata202
5 2 74000 somedata20 somedata201 somedata202
6 3 60000 somedata30 somedata301 somedata302
Upvotes: 2
Views: 84
Reputation: 765
def function1(dd: pd.DataFrame):
return pd.DataFrame(data=dd.T.to_numpy(),index=dd.iloc[[0]].index)
df1.set_index(['employee_id', 'other1', 'other2', 'other3']).groupby(['employee_id'], as_index=0, group_keys=0).apply(function1).add_prefix('prevsalary').rename(columns={'prevsalary0':'salary'}).ffill(axis=1).astype(int).reset_index()
salary prevsalary1 prevsalary2 prevsalary3 other1 \
employee_id
1 50000.0 48000.0 48000.0 48000.0 somedata1
2 80000.0 77000.0 75000.0 74000.0 somedata20
3 60000.0 60000.0 60000.0 60000.0 somedata30
other2 other3
employee_id
1 somedata2 somedata3
2 somedata21 somedata22
3 somedata31 somedata32
Upvotes: 0
Reputation: 1410
pivot the table of salaries first, then merge with the non-salary data
# first create a copy of the dataset without the salary column
dataset_without_salaries = df.drop('salary', axis=1).drop_duplicates()
# pivot only salary column
temp = pd.pivot_table(data=df[['salary']], index=df['employee_id'], aggfunc=list)
# expand the list
temp2 = temp.apply(lambda x: pd.Series(x['salary']), axis=1)
# merge the two together
final = pd.merge(temp2, dataset_without_salaries)
Upvotes: -1
Reputation: 88226
One way is using pd.pivot_table
with ffill
:
g = df.groupby('employee_id')
cols = g.salary.cumcount()
out = df.pivot_table(index='employee_id', values='salary', columns=cols).ffill(1)
# Crete list of column names matching the expected output
out.columns = ['salary'] + [f'prevsalary{i}' for i in range(1,len(out.columns))]
print(out)
salary prevsalary1 prevsalary2 prevsalary3
employee_id
1 50000.0 48000.0 48000.0 48000.0
2 80000.0 77000.0 75000.0 74000.0
3 60000.0 60000.0 60000.0 60000.0
Now we just need to join with the unique other
columns from the original dataframe:
out = out.join(df.filter(like='other').groupby(df.employee_id).first())
print(out)
salary prevsalary1 prevsalary2 prevsalary3 other1 \
employee_id
1 50000.0 48000.0 48000.0 48000.0 somedata1
2 80000.0 77000.0 75000.0 74000.0 somedata20
3 60000.0 60000.0 60000.0 60000.0 somedata30
other2 other3
employee_id
1 somedata2 somedata3
2 somedata21 somedata22
3 somedata31 somedata32
Upvotes: 3