Wai Tong
Wai Tong

Reputation: 339

Pandas: Discrepancy after group by

I have run a very simple aggregation by quarter on Pandas and tested the results just out of curiosity.

    dfQtr = df.groupby([pd.TimeGrouper(key= 'Date', freq='Q'),'JourneyType','OriginCode','DestinationCode']).agg(np.sum).reset_index()

    print sum(dfQtr.TotalFlights) , sum(df.TotalFlights)              
                       941899              967205

@IanS My apologies, here is a subset of the fairly big data set

Date            JourneyType             OriginCode            DestinationCode Total_Flights
01/08/2015  T_A-M-R-A-S_M_R_M_S D_P         FLL                     SDQ                 1
01/08/2015  T_A-M-R-A-S_M_R_M_S D_P         PAP                     FLL                 1
01/08/2015  T_A-M-R-A-S_M_R_M_S D_P         TPA                     BDL                 1
01/08/2015  T_A-M-R-A-S_M_R_M_S D_P         HPN                     MCO                 1
01/08/2015  T_A-L-O-C-G_L_P_D_S D_P         FLL                     PAP                 1
01/08/2015  T_A-L-O-C-G_L_P_D_S D_P         FLL                     PAP                 1
01/08/2015  T_A-L-O-C-G_L_P_D_S D_P         FLL                     PIT                 1

The result shows that there are a different before & after aggregation and I wonder why that might be?

Many thanks! Will

Upvotes: 0

Views: 59

Answers (1)

elelias
elelias

Reputation: 4779

"NA groups in GroupBy are automatically excluded"

http://pandas.pydata.org/pandas-docs/stable/missing_data.html#na-values-in-groupby

I'm guessing you have some missing values somewhere.

Upvotes: 1

Related Questions