user5057431
user5057431

Reputation:

Pandas dataframe: calculate sum of values for each hour?

I am a beginner in Python, mostly used to matlab. I'm having a bit of problem with syntax.

I am working with a pandas dataframe "df" of several columns. In df is a column with time as a string of values: df['Hour'], and a column with a count of values df['values']. I basically want to calculate the sum of 'values' for each different Hour.

Here is my approach. Can someone give me advice on how to translate this basic idea to something clean? There must be a simpler way to do this!

Thanks a lot for the help!

# first sort my dataframe by ascending hours
df = df.sort(['Hour'],ascending=[1])

# initialize a new empty column in df, called "change"
df['change'] = np.zeros(len(df['Hour']))

# loop over df, finding all the indices where the hour changes (where "i" is not equal to "i-1"). I call this array of changes "A"
i = 0
for i in range(len(df)):
    A = numpy.where(df['Hour'][i] != df['Hour'][i-1]) 
    #if the index is the same any value of A, that means the Hour has changed
    if i == A:
    #assign to df['change'] the sum of df['values'] for the current hour (sum of values for hour = 0, then hour = 1, then hour = 2, etc).
        df['change'] = "df['values'].sum() where df['Hour'] is the same" #how can I write this?
i = i+1

Upvotes: 1

Views: 3805

Answers (2)

TheBlackCat
TheBlackCat

Reputation: 10298

There are several ways to do this. Some approaches are faster or easier than others.

Approach 1: groupby given a column name. This is the fastest if the thing you want to group by is a column rather than the index.

>>> %timeit df.values.groupby('Hour').sum()
1000 loops, best of 3: 1.35 ms per loop

Approach 2: groupby on the Hour values. This is Ami's approach. It is good when the thing you want to group by is not in the DataFrame you want to group, but will be slower in your case.

>>> %timeit df.values.groupby(df.Hour).sum()
100 loops, best of 3: 6.95 ms per loop

Approach 3: Convert the Hour column to an index, then sum on the index. This is the fastest approach by a considerable margin so long as the thing you want to sum is already an index. In your case I think having Hour be an index will simplify a lot of things, so this is the approach I would use.

>>> df.set_index('Hour', inplace=True)
>>> %timeit df.sum()
1000 loops, best of 3: 744 µs per loop

Approach 4: Convert the Hour column to an index, groupby on the index, then sum on the groupby. This approach will work better if you have a MultiIndex.

>>> df.set_index('Hour', inplace=True)
>>> %timeit df.groupby(level=0).sum()
100 loops, best of 3: 3.19 ms per loop

Upvotes: 3

Ami Tavory
Ami Tavory

Reputation: 76297

You can use a groupby+sum combination:

df.values.groupby(df.Hour).sum()

In general, you might want to avoid loops with your pandas code - you're losing out on the speed.

Upvotes: 2

Related Questions