Reputation: 1712
Assume I'm having two data frames, t1h
and t2h
. I want to merge that dataframe in such a way that for a specific list of columns if those rows seem to be similar, I need to perform addition operation with contents of the rest of the columns.
t1h
timestamp ip domain http_status \
0 1475740500.0 192.168.1.1 example.com 200
1 1475740500.0 192.168.1.1 example.com 200
2 1475740500.0 192.168.1.1 example.com 201
3 1475740500.0 192.168.1.1 example.com 201
4 1475740500.0 192.168.1.1 example.com 202
test b_count b_sum test_count test_sum data1 \
0 False 46 24742949931480 46 9.250 0
1 True 48 28151237474796 48 9.040 0
2 False 36 21702308613722 36 7.896 0
3 True 24 13112423049120 24 5.602 0
4 False 62 29948023487954 62 12.648 0
data2
0 0
1 0
2 0
3 0
4 0
t2h
timestamp ip domain http_status \
0 1475740500.0 192.168.1.1 example.com 200
1 1475740500.0 192.168.1.1 example.com 200
2 1475740500.0 192.168.1.1 example.com 201
3 1475740500.0 192.168.1.1 example.com 201
4 1475740500.0 192.168.1.1 example.com 202
test b_count b_sum test_count test_sum data1 \
0 False 44 22349502626302 44 9.410 0
1 True 32 16859760597754 32 5.988 0
2 False 46 23478212117794 46 8.972 0
3 True 36 20956236750016 36 7.124 0
4 False 54 35255787384306 54 9.898 0
data2
0 0
1 0
2 0
3 0
4 0
based on the below column list I need to get the output:
groupby_fields = ['timestamp', 'ip', 'domain', 'http_status', 'test']
pd.merge(t1h, t2h, on=groupby_fields)
timestamp ip domain http_status \
0 1475740500.0 192.168.1.1 example.com 200
1 1475740500.0 192.168.1.1 example.com 200
2 1475740500.0 192.168.1.1 example.com 201
3 1475740500.0 192.168.1.1 example.com 201
4 1475740500.0 192.168.1.1 example.com 202
test b_count_x b_sum_x test_count_x test_sum_x \
0 False 46 24742949931480 46 9.250
1 True 48 28151237474796 48 9.040
2 False 36 21702308613722 36 7.896
3 True 24 13112423049120 24 5.602
4 False 62 29948023487954 62 12.648
data1_x data2_x b_count_y b_sum_y \
0 0 0 44 22349502626302
1 0 0 32 16859760597754
2 0 0 46 23478212117794
3 0 0 36 20956236750016
4 0 0 54 35255787384306
test_count_y test_sum_y data1_y data2_y
0 44 9.410 0 0
1 32 5.988 0 0
2 46 8.972 0 0
3 36 7.124 0 0
4 54 9.898 0 0
I want it in such a way that the output should look like:
Note: except the columns in groupby_fields
every other column are of either type either int
or float
timestamp ip domain http_status \
0 1475740500.0 192.168.1.1 example.com 200
1 1475740500.0 192.168.1.1 example.com 200
2 1475740500.0 192.168.1.1 example.com 201
3 1475740500.0 192.168.1.1 example.com 201
4 1475740500.0 192.168.1.1 example.com 202
test b_count b_sum test_count test_sum \
0 False 90 47092452557782 90 18.660
1 True 80 45010998072550 80 15.028
2 False 82 45180520731516 82 16.868
3 True 60 34068659799136 60 12.726
4 False 116 65203810872260 116 22.546
data1 data2 \
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
Please let me know how can I achieve that in an optimized way.
Upvotes: 2
Views: 152
Reputation: 1062
groupby.agg()
functionAssuming that t1h
and t2h
already exist, and have the same column names
groupby_fields = ['timestamp', 'ip', 'domain', 'http_status', 'test']
df = t2h.append(t2h, ignore_index = True)
agg_dict = {'b_count':'count',
'b_sum':'sum',
'test_count':'count',
'test_sum':'sum',
'data1':'sum',
'data2':'sum'}
df.groupby(groupby_fields).agg(agg_dict).reset_index()
Upvotes: 1