Jithin
Jithin

Reputation: 1712

Merge data based on some specific columns, pandas

Assume I'm having two data frames, t1h and t2h. I want to merge that dataframe in such a way that for a specific list of columns if those rows seem to be similar, I need to perform addition operation with contents of the rest of the columns.

t1h

      timestamp           ip       domain http_status  \
0  1475740500.0  192.168.1.1  example.com         200
1  1475740500.0  192.168.1.1  example.com         200
2  1475740500.0  192.168.1.1  example.com         201
3  1475740500.0  192.168.1.1  example.com         201
4  1475740500.0  192.168.1.1  example.com         202

      test  b_count           b_sum  test_count  test_sum  data1  \
0    False       46  24742949931480          46     9.250      0
1     True       48  28151237474796          48     9.040      0
2    False       36  21702308613722          36     7.896      0
3     True       24  13112423049120          24     5.602      0
4    False       62  29948023487954          62    12.648      0

   data2
0      0
1      0
2      0
3      0
4      0

t2h

      timestamp           ip       domain http_status  \
0  1475740500.0  192.168.1.1  example.com         200
1  1475740500.0  192.168.1.1  example.com         200
2  1475740500.0  192.168.1.1  example.com         201
3  1475740500.0  192.168.1.1  example.com         201
4  1475740500.0  192.168.1.1  example.com         202

      test  b_count           b_sum  test_count  test_sum  data1  \
0    False       44  22349502626302          44     9.410      0
1     True       32  16859760597754          32     5.988      0
2    False       46  23478212117794          46     8.972      0
3     True       36  20956236750016          36     7.124      0
4    False       54  35255787384306          54     9.898      0

   data2
0      0
1      0
2      0
3      0
4      0

based on the below column list I need to get the output:

groupby_fields = ['timestamp', 'ip', 'domain', 'http_status', 'test']

pd.merge(t1h, t2h, on=groupby_fields)

      timestamp          ip        domain http_status  \
0  1475740500.0  192.168.1.1  example.com         200
1  1475740500.0  192.168.1.1  example.com         200
2  1475740500.0  192.168.1.1  example.com         201
3  1475740500.0  192.168.1.1  example.com         201
4  1475740500.0  192.168.1.1  example.com         202

      test  b_count_x         b_sum_x  test_count_x  test_sum_x  \
0    False         46  24742949931480            46       9.250
1     True         48  28151237474796            48       9.040
2    False         36  21702308613722            36       7.896
3     True         24  13112423049120            24       5.602
4    False         62  29948023487954            62      12.648

   data1_x  data2_x  b_count_y         b_sum_y  \
0        0        0         44  22349502626302
1        0        0         32  16859760597754
2        0        0         46  23478212117794
3        0        0         36  20956236750016
4        0        0         54  35255787384306

   test_count_y  test_sum_y  data1_y  data2_y
0            44       9.410        0        0
1            32       5.988        0        0
2            46       8.972        0        0
3            36       7.124        0        0
4            54       9.898        0        0

I want it in such a way that the output should look like:

Note: except the columns in groupby_fields every other column are of either type either int or float

      timestamp          ip        domain http_status  \
0  1475740500.0  192.168.1.1  example.com         200
1  1475740500.0  192.168.1.1  example.com         200
2  1475740500.0  192.168.1.1  example.com         201
3  1475740500.0  192.168.1.1  example.com         201
4  1475740500.0  192.168.1.1  example.com         202

      test  b_count           b_sum  test_count  test_sum  \
0    False       90  47092452557782          90    18.660
1     True       80  45010998072550          80    15.028
2    False       82  45180520731516          82    16.868
3     True       60  34068659799136          60    12.726
4    False      116  65203810872260         116    22.546

   data1  data2  \
0      0      0  
1      0      0  
2      0      0  
3      0      0  
4      0      0  

Please let me know how can I achieve that in an optimized way.

Upvotes: 2

Views: 152

Answers (1)

Shivam Gaur
Shivam Gaur

Reputation: 1062

Great use case for the groupby.agg() function

Assuming that t1h and t2h already exist, and have the same column names

groupby_fields = ['timestamp', 'ip', 'domain', 'http_status', 'test']

df = t2h.append(t2h, ignore_index = True)

agg_dict = {'b_count':'count',
        'b_sum':'sum',
        'test_count':'count',
        'test_sum':'sum',
        'data1':'sum',
        'data2':'sum'}

df.groupby(groupby_fields).agg(agg_dict).reset_index()

Upvotes: 1

Related Questions