Carlo june Caimen
Carlo june Caimen

Reputation: 23

Merge rows with the same date and id and append to the right side

I want to merge rows with the same date and empIdn from this dataframe:

,empIdn,date,time_0,time_1,time_2,time_3,time_4
0,191206,2020-12-02,07:22:50,12:15:21,12:15:23,12:35:35
1,191206,2020-12-02,17:27:46,17:27:49,,

and I want to achieve like this:

,empIdn,date,time_0,time_1,time_2,time_3,time_4,time_5,time_6
0,191206,2020-12-02,07:22:50,12:15:21,12:15:23,12:35:35,17:27:46,17:27:49

I would like to seek some help on how to do this, what I have done so far:

df1 = pd.read_csv("1.csv",index_col=[0])
df2 = pd.read_csv("2.csv",index_col=[0])

final_df = pd.concat([df1, df2], sort = True)
final_df.groupby(['date', 'empIdn']).agg(lambda x: x.tolist())

final_df.to_csv("fff.csv")

I'm new to Pandas.

Upvotes: 0

Views: 104

Answers (1)

Umar.H
Umar.H

Reputation: 23099

You need to first melt your dataframe, order it by the current index, empIdn and date.

Then use a groupby.cumcount() method to create a new time_ counter based of the above ordering.

The final step is to create a new index and then unstack.

Let's add a few records to your example to see what happens with non dupes.

The main difference between our expected outputs is that I drop time_4 for the first row due to it being NA if this is not intended behavior then you can first fillna on the first value of a duplicate row with an arbitrary value.

Something like df = df.fillna(df[df.duplicated(subset=['empIdn','date'],keep='last')].fillna(-1))

print(df)

   empIdn       date    time_0    time_1    time_2    time_3  time_4
0  191206 2020-12-02  07:22:50  12:15:21  12:15:23  12:35:35     NaN
1  191206 2020-12-02  17:27:46  17:27:49       NaN       NaN     NaN
2  191207 2020-12-02  07:22:50  12:15:21  12:15:23  12:35:35     NaN
3  191207 2020-11-02  07:22:50  12:15:21  12:15:23  12:35:35     NaN
4  191207 2020-12-02  17:27:46  17:27:49       NaN       NaN     NaN

Based on the above example, we can expect an ouput of three rows, where empIdn 191206 and 191207 are duplicated by date and 192017 left alone since its not a duplicate.


df1 = (
    pd.melt(df.reset_index(), id_vars=["empIdn", "date", "index"], var_name="time")
    .dropna(subset=["value"]) ## we don't care about nulls here.
    .sort_values(["empIdn", "date", "index"])
)

df1['time'] = 'time_' + df1.assign(key=df1.groupby(['empIdn','date']).cumcount()).astype(str)['key']

final = df1.set_index(["empIdn", "date", "time"]).drop("index", 1).unstack("time").droplevel(
    0, 1
).reset_index()

print(final)

enter image description here


with -1 as a filler value.

enter image description here

Unpacking cumcount

from the below you can see it's clear when the key increments to create our new key column -

The key here is to order by your original index and groupby empIdn and date

print(df1.assign(key=df1.groupby(['empIdn','date']).cumcount()))
    empIdn        date index    time     value key
0   191206  2020-12-02     0  time_0  07:22:50   0
5   191206  2020-12-02     0  time_1  12:15:21   1
10  191206  2020-12-02     0  time_2  12:15:23   2
15  191206  2020-12-02     0  time_3  12:35:35   3
1   191206  2020-12-02     1  time_0  17:27:46   4 # < -- new row in original dataframe.
6   191206  2020-12-02     1  time_1  17:27:49   5 # < -- we want to increment these to time_4 / time_5
3   191207  2020-11-02     3  time_0  07:22:50   0
8   191207  2020-11-02     3  time_1  12:15:21   1
13  191207  2020-11-02     3  time_2  12:15:23   2
18  191207  2020-11-02     3  time_3  12:35:35   3
2   191207  2020-12-02     2  time_0  07:22:50   0
7   191207  2020-12-02     2  time_1  12:15:21   1
12  191207  2020-12-02     2  time_2  12:15:23   2
17  191207  2020-12-02     2  time_3  12:35:35   3
4   191207  2020-12-02     4  time_0  17:27:46   4
9   191207  2020-12-02     4  time_1  17:27:49   5

Upvotes: 1

Related Questions