Pandas - How to remove duplicates based on another series?

Question

I have a dataframe that contains three series called Date, Element, and Data_Value--their types are string, string, and numpy.int64 respectively. Date has dates in the form of yyyy-mm-dd; Element has strings that say either TMIN or TMAX, and it denotes whether the Data_Value is the minimum or maximum temperature of a particular date; lastly, the Data_Value series just represents the actual temperature.

The date series has multiple duplicates of the same date. E.g. for the date 2005-01-01, there are 19 entries for the temperature column, the values start at 28 and go all the way up to 156. I want to create a new dataframe with the date and the maximum temperature only--I'll eventually want one for TMIN values too, but I figure that if I can do one I can figure out the other. I'll post some psuedocode with explanation below to show what I've tried so far.

So far I have pulled in the csv and assigned it to a variable, df. Then I sorted the values by Date, Element and Temperature (Data_Value). After that, I created a variable called tmax that grabs the necessary dates (I only need the data from 2005-2014) that have 'TMAX' as its Element value. I cast tmax into a new DataFrame, reset its index to get rid of the useless index data from the first dataframe, and dropped the 'Element' column since it was redundant at this point. Now I'm (ultimately) trying to create a list of all the Temperatures for TMAX so that I can plot it with pyplot. But I can't figure out for the life of me how to reduce the dataframe to just the single date and max value for that date. If I could just get that then I could easily convert the series to a list and plot it.


    def record_high_and_low_temperatures():
        #read in csv
        df = pd.read_csv('somedata.csv') 

        #sort values so they're in a nice order
        df.sort_values(by=['Date', 'Element', 'Data_Value'], inplace=True) 

        # grab all entries for TMAX in correct date range
        tmax = df[(df['Element'] == 'TMAX') & (df['Date'].between("2005-01-01", "2014-12-31"))]

        # cast to dataframe
        tmax = pd.DataFrame(tmax, columns=['Date', 'Data_Value'])

        # Remove index column from previous dataframe
        tmax.reset_index(drop=True, inplace=True)

        # this is where I'm stuck, how do I get the max value per unique date? 
        max_temp_by_date = tmax.loc[tmax['Data_Value'].idxmax()]

Any and all help is appreciated, let me know if I need to clarify anything.

TL;DR: Ok... input dataframe looks like

date     | data_value
2005-01-01    28
2005-01-01    33
2005-01-01    33
2005-01-01    44
2005-01-01    56
2005-01-02    0
2005-01-02    12
2005-01-02    30
2005-01-02    28
2005-01-02    22

Expected df should look like:

date     | data_value
2005-01-01    79
2005-01-02    90
2005-01-03    88
2005-01-04    44
2005-01-05    63

I just want a dataframe that has each unique date coupled with the highest temperature on that day.

BStadlbauer · Accepted Answer

If I understand you correctly, what you would want to do is as Grzegorz already suggested in the comments, is to groupby date (take all elements of one date) and then take the maximum of that date:

df.groupby('date').max()

This will take all your groups and reduce them to only one row, taking the maximum element of every group. In this case, max() is called the aggregation function of the group. As you mentioned that you will also need the minimum at some point, a nice way to do this (instead of two groupbys) is to do the following:

df.groupby('date').agg(['max', 'min'])

which will pass over all groups once and apply both aggregation functions max and min returning two columns for each input column. More documentation on aggregation is here.

Pandas - How to remove duplicates based on another series?

Answers (2)

Related Questions