Reputation: 23
I want to merge rows with the same date and empIdn from this dataframe:
,empIdn,date,time_0,time_1,time_2,time_3,time_4
0,191206,2020-12-02,07:22:50,12:15:21,12:15:23,12:35:35
1,191206,2020-12-02,17:27:46,17:27:49,,
and I want to achieve like this:
,empIdn,date,time_0,time_1,time_2,time_3,time_4,time_5,time_6
0,191206,2020-12-02,07:22:50,12:15:21,12:15:23,12:35:35,17:27:46,17:27:49
I would like to seek some help on how to do this, what I have done so far:
df1 = pd.read_csv("1.csv",index_col=[0])
df2 = pd.read_csv("2.csv",index_col=[0])
final_df = pd.concat([df1, df2], sort = True)
final_df.groupby(['date', 'empIdn']).agg(lambda x: x.tolist())
final_df.to_csv("fff.csv")
I'm new to Pandas.
Upvotes: 0
Views: 104
Reputation: 23099
You need to first melt
your dataframe, order it by the current index
, empIdn
and date
.
Then use a groupby.cumcount()
method to create a new time_
counter based of the above ordering.
The final step is to create a new index
and then unstack
.
Let's add a few records to your example to see what happens with non dupes.
The main difference between our expected outputs is that I drop time_4
for the first row due to it being NA
if this is not intended behavior then you can first fillna
on the first value of a duplicate row with an arbitrary value.
Something like df = df.fillna(df[df.duplicated(subset=['empIdn','date'],keep='last')].fillna(-1))
print(df)
empIdn date time_0 time_1 time_2 time_3 time_4
0 191206 2020-12-02 07:22:50 12:15:21 12:15:23 12:35:35 NaN
1 191206 2020-12-02 17:27:46 17:27:49 NaN NaN NaN
2 191207 2020-12-02 07:22:50 12:15:21 12:15:23 12:35:35 NaN
3 191207 2020-11-02 07:22:50 12:15:21 12:15:23 12:35:35 NaN
4 191207 2020-12-02 17:27:46 17:27:49 NaN NaN NaN
Based on the above example, we can expect an ouput of three rows, where empIdn
191206 and 191207 are duplicated by date and 192017
left alone since its not a duplicate.
df1 = (
pd.melt(df.reset_index(), id_vars=["empIdn", "date", "index"], var_name="time")
.dropna(subset=["value"]) ## we don't care about nulls here.
.sort_values(["empIdn", "date", "index"])
)
df1['time'] = 'time_' + df1.assign(key=df1.groupby(['empIdn','date']).cumcount()).astype(str)['key']
final = df1.set_index(["empIdn", "date", "time"]).drop("index", 1).unstack("time").droplevel(
0, 1
).reset_index()
print(final)
with -1
as a filler value.
cumcount
from the below you can see it's clear when the key
increments to create our new key
column -
The key here is to order
by your original index and groupby
empIdn
and date
print(df1.assign(key=df1.groupby(['empIdn','date']).cumcount()))
empIdn date index time value key
0 191206 2020-12-02 0 time_0 07:22:50 0
5 191206 2020-12-02 0 time_1 12:15:21 1
10 191206 2020-12-02 0 time_2 12:15:23 2
15 191206 2020-12-02 0 time_3 12:35:35 3
1 191206 2020-12-02 1 time_0 17:27:46 4 # < -- new row in original dataframe.
6 191206 2020-12-02 1 time_1 17:27:49 5 # < -- we want to increment these to time_4 / time_5
3 191207 2020-11-02 3 time_0 07:22:50 0
8 191207 2020-11-02 3 time_1 12:15:21 1
13 191207 2020-11-02 3 time_2 12:15:23 2
18 191207 2020-11-02 3 time_3 12:35:35 3
2 191207 2020-12-02 2 time_0 07:22:50 0
7 191207 2020-12-02 2 time_1 12:15:21 1
12 191207 2020-12-02 2 time_2 12:15:23 2
17 191207 2020-12-02 2 time_3 12:35:35 3
4 191207 2020-12-02 4 time_0 17:27:46 4
9 191207 2020-12-02 4 time_1 17:27:49 5
Upvotes: 1