sato
sato

Reputation: 862

Merging certain rows in pandas dataframe

I have this dataframe, consisting in 73 rows:

Date    Col1    Col2   Col3
1975   float   float  float
1976   float   float  float
1976   float   float  float
1977   float   float  float
1978   float   float  float
....
....

There are certain years appearing twice because the values were taken twice that year. What I want to do is to merge those rows where the year is the same, taking the mean value of each column for those specific two rows. The fact is that I am still familiarizing with pandas and I don't really understand the usage of the loc and iloc selectors. This is what I have tried, but I am sure this is completely wrong and non-pythonic:

for i in range(72):
    if df.Date[i]==df.Date[i+1]:
        df.Very_satisfied[i]= (df.Very_satisfied[i]+df.Very_satisfied[i+1])/2
        df.Fairly_satisfied[i]= (df.Fairly_satisfied[i]+df.Fairly_satisfied[i+1])/2
        df.NV_satisfied[i]= (df.NV_satisfied[i]+ df.NV_satisfied[i+1])/2
        df.Not_satisfied[i]= (df.Not_satisfied[i]+ df.Not_satisfied[i+1])/2
        df.DK[i]= (df.DK[i]+ df.DK[i+1])/2
        a=i+1
        str(a)
        df.drop(a)

where "very satisfied", "fairly satisfied" ecc. are the columns. The point in my code is: if two years are the same calculate the mean of each value, substitute it in the first row and delete the second row. I really need something smarter and more elegant.

Upvotes: 0

Views: 1225

Answers (1)

Unatiel
Unatiel

Reputation: 1080

You can use groupby() and then mean() for this. Here is an example :

import pandas as pd
import numpy as np

df = pd.DataFrame({'date': list(range(25)) * 2, 'col1': np.random.random(50) * 100, 'col2': np.random.random(50)})
df.groupby('date').mean()

This will take all the rows which the same date, calculate the mean value of all the rows in the group for each column.

This outputs on my sample :

df.groupby('date').mean().head()
           col1      col2
date
0     42.881950  0.436073
1     32.114299  0.309742
2     96.819446  0.809071
3     30.606661  0.284257
4     40.690211  0.624972

For this input :

df[df['date'] < 5]

    date       col1      col2
0      0  67.268605  0.393560
1      1  55.864578  0.508636
2      2  97.735942  0.861162
3      3  58.014599  0.117055
4      4   7.429489  0.637101
25     0  18.495296  0.478585
26     1   8.364020  0.110848
27     2  95.902950  0.756980
28     3   3.198724  0.451460
29     4  73.950932  0.612843

Upvotes: 1

Related Questions