Pandas grouping logic

Question

In A we have column named #type that contain objects like item1, item2 etc. each needs to become new column.

item1 have id, you can use it to find item2 with the same id, same goes for item3 and item4 (some may have missing data and have no entries for them but thats is rare, 0 can be put there)

So group items by ID, the date will be the same for all the entries with same ID. As for the value for the itemX use the result from A.

I have csv file A:

#Result;ID;Date;Events;type
12;1240422;10/01/2017 10:10;1;Item1
2;1241468;12/01/2017 09:15;0;Item1
7.8;1241469;12/01/2017 09:40;0;Item1
151;1241470;12/01/2017 10:00;1;Item1
0;1241471;12/01/2017 10:20;0;Item1
3;1241472;12/01/2017 10:40;0;Item1
0;1241473;12/01/2017 11:10;0;Item1
164;1240422;10/01/2017 10:10;1;Item2
162;1241468;12/01/2017 09:15;0;Item2
161;1241469;12/01/2017 09:40;0;Item2
161;1241470;12/01/2017 10:00;1;Item2
162;1241471;12/01/2017 10:20;0;Item2
349;1241472;12/01/2017 10:40;0;Item2
162;1241473;12/01/2017 11:10;0;Item2
7.8;1240422;10/01/2017 10:10;1;Item3
7.8;1241468;12/01/2017 09:15;0;Item3
7.8;1241469;12/01/2017 09:40;0;Item3
7.8;1241470;12/01/2017 10:00;1;Item3
7.9;1241471;12/01/2017 10:20;0;Item3
7.7;1241472;12/01/2017 10:40;0;Item3
7.8;1241473;12/01/2017 11:10;0;Item3
0;1240422;10/01/2017 10:10;1;Item4
0;1241468;12/01/2017 09:15;0;Item4
0;1241469;12/01/2017 09:40;0;Item4
0;1241470;12/01/2017 10:00;1;Item4
0;1241471;12/01/2017 10:20;0;Item4
0.17;1241472;12/01/2017 10:40;0;Item4
0;1241473;12/01/2017 11:10;0;Item4

Then I need output like this B:

Item1;Item2;Item3;Item4;Events;Date;ID
12;164;7.8;0;1;10/01/2017 10:10;1240422
2;162;7.8;0;0;12/01/2017 09:15;1241468
7.8;161;7.8;0;0;12/01/2017 09:40;1241469
151;161;7.8;0;1;12/01/2017 10:00;1241470
0;162;7.9;0;0;12/01/2017 10:20;1241471
3;349;7.7;0.17;0;12/01/2017 10:40;1241472
0;162;7.8;0;0;12/01/2017 11:10;1241473

Code below does the job but there is small issue: (if the event value differs in one of the entries then it just creates duplicate like that:

Item1;Item2;Item3;Item4;Events;Date;ID
3;349;;0.17;0;12/01/2017 10:40;1241472
;;7.7;;1;12/01/2017 10:40;1241472

The event value should be 1 if even one of them is 1 and zero if all are 0

import pandas as pd
from io import StringIO

csvtxt = StringIO("""Result;ID;Date;Events;type
12;1240422;10/01/2017 10:10;1;Item1
2;1241468;12/01/2017 09:15;0;Item1
7.8;1241469;12/01/2017 09:40;0;Item1
151;1241470;12/01/2017 10:00;1;Item1
0;1241471;12/01/2017 10:20;0;Item1
3;1241472;12/01/2017 10:40;0;Item1
0;1241473;12/01/2017 11:10;0;Item1
164;1240422;10/01/2017 10:10;1;Item2
162;1241468;12/01/2017 09:15;0;Item2
161;1241469;12/01/2017 09:40;0;Item2
161;1241470;12/01/2017 10:00;1;Item2
162;1241471;12/01/2017 10:20;0;Item2
349;1241472;12/01/2017 10:40;0;Item2
162;1241473;12/01/2017 11:10;0;Item2
7.8;1240422;10/01/2017 10:10;1;Item3
7.8;1241468;12/01/2017 09:15;0;Item3
7.8;1241469;12/01/2017 09:40;0;Item3
7.8;1241470;12/01/2017 10:00;1;Item3
7.9;1241471;12/01/2017 10:20;0;Item3
7.7;1241472;12/01/2017 10:40;0;Item3
7.8;1241473;12/01/2017 11:10;0;Item3
0;1240422;10/01/2017 10:10;1;Item4
0;1241468;12/01/2017 09:15;0;Item4
0;1241469;12/01/2017 09:40;0;Item4
0;1241470;12/01/2017 10:00;1;Item4
0;1241471;12/01/2017 10:20;0;Item4
0.17;1241472;12/01/2017 10:40;0;Item4
0;1241473;12/01/2017 11:10;0;Item4""")

df = pd.read_csv(csvtxt, sep=';')
df_out = df.set_index(['ID','Date','Events','type'])['Result'].unstack().reset_index()
df_out.to_csv('out.csv', sep=';', index=False)
!type out.csv

Any Suggestions? Thanks

RubenB · Accepted Answer

I believe this should do the trick:

import pandas as pd
from functools import reduce

df.set_index('ID', inplace=True)

# create a df based on all items 1, and keep date; note, this will make the date of item1 leading
df_base = df[df.type=='Item1'].loc[:, 'Date'].to_frame()

# pivot the df to get a column per type and Result as value
df_items = df.loc[:,['Result', 'type']].pivot(columns='type', values='Result')

# pivot the df to get a column per type and Events as value; check if any item had an event
df_events = df.loc[:,['Events', 'type']].pivot(columns='type', values='Events')
df_events = df_events.any(axis=1).astype(int).to_frame().rename(columns={0:'Events'})

# merge all dfs on ID
df_out = reduce(lambda left,right: pd.merge(left, right, left_index=True, right_index=True), [df_base, df_items, df_events])

# fill missing values with 0
df_out.fillna(value=0, inplace=True)

I've tested by changing the event of item 3 with ID 1241472 to 1 with no duplicate rows as a result.

Pandas grouping logic

Answers (1)

Related Questions