Reputation: 61
I have a dataframe (called df) that looks like this:
I'm trying to take all weekend 'Volume' values (the ones where column 'WEEKDAY'=5 (saturday) or 6(sunday)) and sum them to the subsequent monday(WEEKDAY=0).
I tried a few things but nothing really worked, taking an example from the last three rows:
What I'm expecting is this:
To reproduce the problem:
!wget https://raw.githubusercontent.com/brunodifranco/TCC/main/volume_por_dia.csv
df = pd.read_csv('volume_por_dia.csv').sort_values('Datas',ascending=True)
df['Datas'] = pd.to_datetime(df['Datas'])
df = df_volume_noticias.set_index('Datas')
df['WEEKDAY'] = df.index.dayofweek
df
Upvotes: 6
Views: 662
Reputation: 96
Adding 2 solutions here:
Using pd.shift
(pointed out earlier by Lukas Hestermeyer; I've added a simplified version)
Using rolling window (this is literally a one-liner)
Both solutions assume that;
Dates
are sorted in ascending order (if not, sorting should be done before proceeding)Part 1 | Data Prep:
import pandas as pd
import numpy as np
# STEP 1: Create DF
Datas = [
'2019-07-02',
'2019-07-03',
'2019-07-04',
'2019-07-05',
'2019-07-06',
'2019-07-07',
'2019-07-08',
'2022-03-10',
'2022-03-11',
'2022-03-12',
'2022-03-13',
'2022-03-14'
]
Volume = [17, 30, 20, 21, 5, 10, 12, 24, 18, 4, 1, 5]
WEEKDAY = [1, 2, 3, 4, 5, 6, 0, 3, 4, 5, 6, 0]
dic = {'Datas': Datas, 'Volume': Volume, 'WEEKDAY': WEEKDAY}
df['Datas'] = pd.to_datetime(df['Datas'])
df = df.set_index('Datas')
df = pd.DataFrame(dic)
Part 2 | Solutions:
Solution 1 [pd.shift] :
# STEP 1: add shifts
df['shift_1'] = df['Volume'].shift(1)
df['shift_2'] = df['shift_1'].shift(1)
# STEP 2: sum Volume with shifts where weekday==0
cols_to_sum = ['Volume', 'shift_1', 'shift_2']
df['Volume'] = df[['WEEKDAY'] + cols_to_sum].apply(lambda x: int(x[1]) if x[0] else int(x[1] + x[2] + x[3]), axis=1)
df = df.drop(['shift_1', 'shift_2'], axis=1)
df
Solution 2 [rolling window] :
# use rolling window of size 3 to sum where weekday == 0
df['Volume'] = np.where(
df['WEEKDAY'] == 0,
df['Volume'].rolling(window=3, center=False).sum(),
df['Volume']
)
df
Part 3 | Removing weekend records :
df = df.loc[~df['WEEKDAY'].isin([5, 6])]
df
Upvotes: 1
Reputation: 2724
Based on the code you provided, you have loaded a CSV file into a DataFrame df, converted the 'Datas' column to a datetime, sorted the DataFrame by date in ascending order, and set the 'Datas' column as the index.
You have also created a new column 'WEEKDAY' that contains the day of the week (0-6, where 0 is Monday and 6 is Sunday) for each date in the index.
import pandas as pd
# Load the data and convert the 'Datas' column to a datetime
df = pd.read_csv('https://raw.githubusercontent.com/brunodifranco/TCC/main/volume_por_dia.csv')
df['Datas'] = pd.to_datetime(df['Datas'])
# Set the 'Datas' column as the index
df.set_index('Datas', inplace=True)
# Compute the sum of weekend days (Saturday and Sunday) and assign it to the next following Monday
weekend_sum = df.loc[df.index.weekday.isin([5,6])]['Volume'].resample('W-MON').sum()
df.loc[weekend_sum.index, 'Volume'] += weekend_sum.values
Explanation:
df.loc[df.index.weekday.isin([5,6])]
selects rows where the index (i.e., the dates) falls on a Saturday or Sunday (weekday 5 or 6, respectively).
['Volume'].resample('W-MON').sum()
computes the sum of the 'Volume' column for each week starting on Monday that contains at least one weekend day. The result is a Series where the index contains the start date of each week and the values contain the corresponding sums.
df.loc[weekend_sum.index, 'Volume'] += weekend_sum.values
assigns the computed sums to the next following Monday. It selects the rows corresponding to the start dates of the weeks with weekend sums (weekend_sum.index)
and adds the corresponding sums (weekend_sum.values)
to the 'Volume' column. Note that the +=
operator is used to modify the original DataFrame df.
Upvotes: 0
Reputation: 356
If you consider that weeks start from, for example, Tuesday, the problem becomes simpler. You just need to get the values of the weekend and sum it to the Monday of that week (which will be the Monday after the weekend). This will automatically handle cases in which you data might start/end on a weekend or not.
import numpy as np
import pandas as pd
np.random.seed(1)
# Sample data
dates = pd.date_range('2018-02-05', '2018-07-22', freq='D')
volume = np.random.randint(1, 50, len(dates))
df = pd.DataFrame(dict(Datas=dates, Volume=volume))
df = df.set_index('Datas')
# Week starting from Tuesday
week = ((df.index - pd.DateOffset(days=1)).isocalendar().week).values
def add_weekend_to_monday(week):
monday = week.index.weekday == 0
weekend = week.index.weekday >= 5
week[monday] += week[weekend].sum()
return week
df['Volume'] = df.groupby(week)['Volume'].apply(add_weekend_to_monday)
Upvotes: 0
Reputation: 2701
Input:
!wget https://raw.githubusercontent.com/brunodifranco/TCC/main/volume_por_dia.csv
import pandas as pd
import numpy as np
df = pd.read_csv('volume_por_dia.csv').sort_values('Datas',ascending=True)
df['Datas'] = pd.to_datetime(df['Datas'])
df.set_index('Datas', inplace=True)
df['WEEKDAY'] = df.index.dayofweek
I assume that the index dates are sorted, the Datas
index is unique and that there are no missing days. Some assumptions I can't make are instead:
For these reasons, before computing weekend volumes, I first extract the dates of the first Saturday and last Monday:
first_saturday = df.index[df.WEEKDAY==5][0]
last_monday = df.index[df.WEEKDAY==0][-1]
Now I can extract weekend volumes being sure that I always have the saturday-sunday couple and that for each of these couple, a following monday exists in the dataframe:
df_weekend = df.loc[
(df.WEEKDAY.isin([5,6]))&
(df.index<=last_monday)&
(df.index>=first_saturday)
]
df_weekend
Now, since I have couples of saturday-sunday volumes, I can compute the sums in the following way:
weekend_volumes = pd.Series(
df_weekend.Volume.values.reshape(-1,2).sum(axis=1), #sum of volume couples
index = df_weekend.index[1::2]+pd.Timedelta("1d"), #date of the following monday
name="weekend_volume"
).reindex(df.index).fillna(0) #zero weekend-volume for days that are not mondays
weekend_volumes
Finally add the weekend-volumes to the starting volumes:
df["Volume"] = df.Volume+weekend_volumes
I am attaching the last 25 lines of the df below:
# 2022-02-18 16.0 4
# 2022-02-19 2.0 5
# 2022-02-20 1.0 6
# 2022-02-21 10.0 0
# 2022-02-22 43.0 1
# 2022-02-23 36.0 2
# 2022-02-24 38.0 3
# 2022-02-25 28.0 4
# 2022-02-26 5.0 5
# 2022-02-27 3.0 6
# 2022-02-28 14.0 0
# 2022-03-01 10.0 1
# 2022-03-02 16.0 2
# 2022-03-03 18.0 3
# 2022-03-04 11.0 4
# 2022-03-05 8.0 5
# 2022-03-06 2.0 6
# 2022-03-07 32.0 0
# 2022-03-08 18.0 1
# 2022-03-09 32.0 2
# 2022-03-10 24.0 3
# 2022-03-11 18.0 4
# 2022-03-12 4.0 5
# 2022-03-13 1.0 6
# 2022-03-14 10.0 0
Upvotes: 0
Reputation: 1456
You can simply loop over the rows and keep accumulating volumes from Friday, and update the value in the volume of Sunday. Then, just drop the rows of Friday and Saturday.
values = df.values
volume_accumulated = 0
for idx, row in enumerate(values):
if row[1] in (5, 6):
volume_accumulated += row[0]
elif row[1] == 0:
volume_accumulated += row[0]
df["Volume"][idx] = volume_accumulated
else:
volume_accumulated = 0
df = df[~df["WEEKDAY"].isin([5, 6])]
Upvotes: 0
Reputation: 179
I used .groupby to solve the problem.
import pandas as pd
df = pd.read_csv('volume_por_dia.csv')
df['Datas'] = pd.to_datetime(df['Datas'])
df['WEEKDAY'] = df['Datas'].dt.dayofweek
df['index'] = df['Datas']
# Group df by date, setting frequency as week
#(beginning Tue - so that Sat and Sun will be added to the next Mon)
df_group = df.groupby([pd.Grouper(key = 'Datas', freq='W-MON'), \
'WEEKDAY', 'index']).agg({'Volume': 'sum'})
# In each group, add days 5, 6 (Sat and Sun) to day 0 (Mon)
df_group.loc[(slice(None), 0), 'Volume'] += \
df_group.loc[(slice(None), [5, 6]), 'Volume'].groupby(level=0).sum()
# In the grouped data, remove Sat and Sun
df_group = df_group.reset_index()
df_group = df_group[df_group['WEEKDAY'] != 5]
df_group = df_group[df_group['WEEKDAY'] != 6]
# Remove volume data from original df, and merge with volume from df_group
df = df.drop(['Volume'], axis=1)
df = pd.merge(df,df_group[['index','Volume']],on='index', how='left')
df = df.dropna(subset=['Volume'])
df = df.drop(['index'], axis=1)
# Optional: sort dates in ascending order
df = df.sort_values(by=['Datas'])
print (df)
Upvotes: 0
Reputation: 1037
This solves your question using pd.shift.
import pandas as pd
df['prior_volume'] = df.Volume.shift(1)
df['prior_volume2'] = df.Volume.shift(2)
df.loc[df['WEEKDAY'] == 0, 'Volume'] = df.loc[df['WEEKDAY'] == 0, 'prior_volume'] + \
df.loc[df['WEEKDAY'] == 0, 'prior_volume2'] + \
df.loc[df['WEEKDAY'] == 0, 'Volume']
df = df[df['WEEKDAY'].isin(range(5))]
df = df[['Volume', 'WEEKDAY']]
df.head(10)
Upvotes: 0