Reputation: 71
I'm starting to belive that pandas dataframes are much less intuitive to handle than Excel, but I'm not giving up yet!
So, I'm JUST trying to check data in the same column but in (various) previous rows using the .shift() method. I'm using the following DF as an example since the original is too complicated to copy into here, but the principle is the same.
counter = list(range(20))
df1 = pd.DataFrame(counter, columns=["Counter"])
df1["Event"] = [True, False, False, False, False, False, True, False,False,False,False,False,False,False,False,False,False,False,False,True]
I'm trying to create sums of the column counter, but only under the following conditions:
To clarify my goal this is the result I had in mind:
My attempt so far looks like this:
for index, row in df1.iterrows():
if row["Event"] == True:
counter = 1
summ = 0
while counter < 10 and row["Event"].shift(counter) == False:
summ += row["Counter"].shift(counter)
counter += 1
else:
df1.at[index, "Sum"] = summ
I'm trying to first find Event == True and from there start iterating backwards with a counter and summing up the counters as I go. However it seems to have a problem with shift:
AttributeError: 'bool' object has no attribute 'shift'
Please shatter my believes and show me, that Excel isn't actually superior.
Upvotes: 1
Views: 80
Reputation: 15432
You definitely can approach a task in pandas the way you would in excel. Your approach needs to be tweaked a bit because pandas.Series.shift
operates on whole arrays or Series, not on a single value - you can't use it just to move back up the dataframe relative to a value.
The following loops through the indices of your dataframe, walking back up (up to) 10 spots for each Event:
def create_sum_column_loop(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
for index in range(df.shape[0]):
counter = 1
summ = 0
if df.loc[index, "Event"]: # == True is implied
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if df.loc[index - backup, "Event"]:
break
# increment by the counter
summ += df.loc[index - backup, "Counter"]
df.loc[index, "Sum"] = summ
return df
This does the job:
In [15]: df1_sum1 = create_sum_column(df1.copy()) # copy to preserve original
In [16]: df1_sum1
Counter Event Sum
0 0 True 0
1 1 False 0
2 2 False 0
3 3 False 0
4 4 False 0
5 5 False 0
6 6 True 15
7 7 False 0
8 8 False 0
9 9 False 0
10 10 False 0
11 11 False 0
12 12 False 0
13 13 False 0
14 14 False 0
15 15 False 0
16 16 False 0
17 17 False 0
18 18 False 0
19 19 True 135
However, the power of pandas comes in its vectorized operations. Python is an interpreted, dynamically-typed language, meaning it's flexible, user friendly (easy to read/write/learn), and slow. To combat this, many commonly-used workflows, including many pandas.Series
operations, are written in optimized, compiled code from other languages like C, C++, and Fortran. Under the hood, they're doing the same thing... df1.Counter.cumsum()
does loop through the elements and create a running total, but it does it in C, making it lightning fast.
This is what makes learning a framework like pandas difficult - you need to relearn how to do math using that framework. For pandas, the entire game is learning how to use pandas and numpy built-in operators to do your work.
Borrowing the clever solution from @YOBEN_S:
def create_sum_column_vectorized(df):
n = 10
s = (
df.Counter
# group by a unique identifier for each event. This is a
# particularly clever bit, where @YOBEN_S reverses
# the order of df.Event, then computes a running total
.groupby(df.Event.iloc[::-1].cumsum())
# compute the rolling sum within each group
.rolling(n+1,min_periods=1).sum()
# drop the group index so we can align with the original DataFrame
.reset_index(level=0,drop=True)
# drop all non-event observations
.where(df.Event)
)
# remove the counter value for the actual event
# rows, then fill the remaining rows with 0s
df['sum'] = (s - df.Counter).fillna(0)
return df
We can see that the result is the same as the one above (though the values are suddenly floats):
In [23]: df1_sum2 = create_sum_column_vectorized(df1) # copy to preserve original
In [24]: df1_sum2
The difference comes in the performance. In ipython or jupyter we can use the %timeit
command to see how long a statement takes to run:
In [25]: %timeit create_sum_column_loop(df1.copy())
3.21 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit create_sum_column_vectorized(df1.copy())
7.76 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For small datasets, like the one in your example, the difference will be negligible or will even slightly favor the pure python loop.
For much larger datasets, the difference becomes apparent. Let's create a dataset similar to your example, but with 100,000 rows:
In [27]: df_big = pd.DataFrame({
...: 'Counter': np.arange(100000),
...: 'Event': np.random.random(size=100000) > 0.9,
...: })
...:
Now, you can really see the performance benefit of the vectorized approach:
In [28]: %timeit create_sum_column_loop(df_big.copy())
13 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [29]: %timeit create_sum_column_vectorized(df_big.copy())
5.81 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The vectorized version takes less than half the time. This difference will continue to widen as the amount of data increases.
Note that for specific operations, it is possible to speed up operations further by pre-compiling the code yourself. In this case, the looped version can be compiled with numba
:
import numba
@numba.jit(nopython=True)
def _inner_vectorized_loop(counter, event, sum_col):
for index in range(len(counter)):
summ = 0
if event[index]:
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if event[index - backup]:
break
# increment by the counter
summ = summ + counter[index - backup]
sum_col[index] = summ
return sum_col
def create_sum_column_loop_jit(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
df["Sum"] = _inner_vectorized_loop(
df.Counter.values, df.Event.values, df.Sum.values)
return df
This beats both pandas and the for loop by a factor of more than 1000!
In [90]: %timeit create_sum_column_loop_jit(df_big.copy())
1.62 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Balancing readability, efficiency, and flexibility is the constant challenge. Best of luck as you dive in!
Upvotes: 1
Reputation: 323226
We need create a subgroup key with cumsum
, then do rolling
sum
n=10
s=df1.Counter.groupby(df1.Event.iloc[::-1].cumsum()).\
rolling(n+1,min_periods=1).sum().\
reset_index(level=0,drop=True).where(df1.Event)
df1['sum']=(s-df1.Counter).fillna(0)
df1
Counter Event sum
0 0 True 0.0
1 1 False 0.0
2 2 False 0.0
3 3 False 0.0
4 4 False 0.0
5 5 False 0.0
6 6 True 15.0
7 7 False 0.0
8 8 False 0.0
9 9 False 0.0
10 10 False 0.0
11 11 False 0.0
12 12 False 0.0
13 13 False 0.0
14 14 False 0.0
15 15 False 0.0
16 16 False 0.0
17 17 False 0.0
18 18 False 0.0
19 19 True 135.0
Upvotes: 4