Reputation: 301
Hi all, The image is the data frame I am working on to learn python.
From this dataframe, I am trying to find the row records which are the last day of Dec for each year. My objective is to keep the record highlighted in yellow in the data frame and remove the white rows.
For example, for the year 2010, I just want to keep the 3rd record and remove rows 1 to 2. As for the year 2011, I want to remove rows 4 to 7 and keep row 8.
Below is the code I have written. I intend to use loop to find the records I want to keep and remove the rest.
To retain records using month values, I managed to meet my objective by keeping Dec and remove Jan to Nov records
However, for days (last 3 lines of code), I realized that the last day does not always end with 31 in the data frame and I cannot use my initial logic to remove.
May I seek help if is there a better solution to find the last day of the month in data frame and remove the rest?
Thanks
amalgamate=pd.read_excel("amalgamate.xlsx")
##Create last 3 columns to segregate Year, Month and Day.
amalgamate["Date"] = pd.to_datetime(amalgamate["Date"], errors = "raise", format = "%Y-%m-%d")
amalgamate["Year"]=amalgamate["Date"].dt.year
amalgamate["Month"]=amalgamate["Date"].dt.month
amalgamate["Day"]=amalgamate["Date"].dt.day
listofMonth=amalgamate.Month.unique()
listofDay=amalgamate.Day.unique()
#Loop through the records and remove records that are not Dec for each year
for eachmonth in listofMonth:
if eachmonth !=12:
amalgamate=amalgamate[amalgamate.Month != eachmonth]
#Loop through the records and remove records that are not 31 for each month
for eachday in listofDay:
if eachday !=31:
amalgamate=amalgamate[amalgamate.Day != eachday]
Upvotes: 2
Views: 5808
Reputation: 23217
As you mentioned at the beginning of the question that you want to find the last day of Dec for each year, you can group the dates by year and get the last entry within a group by GroupBy.last()
, as follows:
df.groupby(df['Date'].dt.year, as_index=False).last()
If you further want to find the last day of a month (as you mentioned at the end of the question), you can group the dates by year and month and get the last entry within a group by GroupBy.last()
, as follows:
df.groupby([df['Date'].dt.year, df['Date'].dt.month], as_index=False).last()
Upvotes: 2
Reputation: 9619
Here is a oneliner that will filter the last days of the months by grouping by Date
with pd.Grouper
set to one month, then getting the last row from each group:
df.loc[df.groupby(pd.Grouper(key='Date', freq='1M')).Date.idxmax()]
Upvotes: 4
Reputation: 1179
You can use pandas groupby to find the last (i.e., max) month and last day per year, then merge dataframes to filter only the rows that have the last month and day. Just as you don't need to assume that the last day of Dec in your data is 31, you don't have to assume that the last month in the year in your data is Dec. There are multiple ways to do it, and you could do the steps below in a different order. Here's one that I think may be easiest to follow:
row1list = [2010, 12, 28]
row2list = [2010, 12, 20]
row3list = [2011, 11, 20]
row4list = [2011, 11, 15]
row5list = [2011, 10, 30]
df = pd.DataFrame([row1list, row2list, row3list, row4list, row5list], columns=['year', 'month', 'day'])
# find last day for every combo of year, month
df_last_day_per_year_month = df.groupby(['year', 'month'], as_index=False).agg({
'day': max})
# find last month for every year, using only the rows with max day per year, month
df_last_month_per_year = df_last_day_per_year_month.groupby('year', as_index=False).agg({
'month': max})
# keep only the last month by comparing month values to last month per year
df_last_month_per_year = df_last_month_per_year.rename(columns={'month':'last_month'})
df_last_day_per_year_month = df_last_day_per_year_month.merge(df_last_month_per_year, on='year', how='left')
df_last_day_per_year_month = df_last_day_per_year_month[df_last_day_per_year_month['month'] == df_last_day_per_year_month['last_month']]
# don't need 'last_month' column anymore so delete it
del df_last_day_per_year_month['last_month']
# inner merge to filter original df to keep only the dates that are max month, day per year
df = df.merge(df_last_day_per_year_month, on=['year', 'month', 'day'], how='inner')
print(df)
# year month day
# 0 2010 12 28
# 1 2011 11 20
Upvotes: 1