Tim Knutson
Tim Knutson

Reputation: 13

How to select multiple values from different columns of a pandas dataframe

I'm trying to select values that have broken the record high or low values. I'm comparing to a DataFrame that has the high and low values for each day as two separate columns. The end goal is to graph a scatterplot of the (date, value) that are the new record values against a line graph of the old record values (using matplotlib.)

Here's an example dataset.

new_data = {'Date': ['1/1/2015', '1/2/2015', '1/3/2015', '1/4/2015', '1/5/2015'],
        'new_low': [10, 25, 24, 21, 15],
        'new_high': [35, 37, 38, 55, 47]}


record_data = {'Day': ['1/1', '1/2', '1/3', '1/4', '1/5'],
           'record_low': [12, 28, 21, 25, 15],
           'record_high': [30, 40, 36, 57, 46]}

df_new = pd.DataFrame(new_data)
df_new.set_index('Date', inplace=True)

df_record = pd.DataFrame(record_data)
df_record.set_index('Day', inplace=True)

So it would look like this

           new_low   new_high (new_data)
Date            
1/1/2015     10         35
1/2/2015     25         37
1/3/2015     24         38
1/4/2015     21         55
1/5/2015     15         47


       record_low   record_high (record_data)
Date            
1/1       12           30
1/2       28           40
1/3       21           36
1/4       25           57
1/5       15           46

I want the result to look along this line.

       Date  Record Value
0  1/1/2015            10
1  1/2/2015            25
2  1/4/2015            21
3  1/1/2015            35
4  1/3/2015            38
5  1/5/2015            47

Since I need to use the result with matplotlib to make a scatterplot, I will need a list of x-values and y-values to enter. My example result was a dataframe that I made, but it doesn't need to be. I could use two separate arrays or even a list of tuples that I could unzip into lists of x and y.

I feel like there should be some simple/elegant way to do this with mapping, but I'm not experienced enough to find it and I haven't been able to find an answer elsewhere.

I'm also having some issues with how to enter the record data with just a month and day as a datestamp, so I've just set them all to the same year. It works for my visualization, but I would rather not do that to the data.

Upvotes: 1

Views: 1964

Answers (2)

djakubosky
djakubosky

Reputation: 1007

Edited to address comments

This is a solution assuming data is read in from a file and avoids merging the two dfs to compare them (note the reindex step).

# # skip the header and ensure the same naming of the columns
# # df_record has Date in format mon/day
df_record = pd.read_csv('record_data.tsv', sep='\t', 
                    skiprows=1, names=['Date','X', 'Y'], index_col = 'Date')

# #df_new has Date in format 'month/day/year'
df_new = pd.read_csv('new_data.tsv', sep='\t', skiprows=1, names=['Date','X', 'Y'])
df_new = df_new.set_index(df_new['Date'].apply(lambda x: "/".join(x.split('/')[:-1]))).drop('Date', axis = 1)

df_new = df_new.reindex(df_record.index)

# compare the columns
tdfX = (df_new['X'] < df_record['X'])
tdfY = (df_new['Y'] > df_record['Y'])

# get the data that is a new record
df_plot = pd.concat([df_new.loc[tdfY[tdfY].index, 'Y'], df_new.loc[tdfX[tdfX].index, 'X']]).to_frame('Record').reset_index()

Upvotes: 1

ALollz
ALollz

Reputation: 59579

There's probably a better answer out there, but you could merge the two DataFrames together, and then determine if the df_new value is a record by comparing the columns.

I wouldn't set the dates as an index, just keep them as a column. It makes it a little bit nicer. If they are your indices, then do this first:

import pandas as pd
df_new['Date'] = df_new.index
df_record['Day'] = df_record.index

Then:

df_new['day'] = pd.to_datetime(df_new.Date).dt.day
df_new['month'] = pd.to_datetime(df_new.Date).dt.month

df_record['day'] = pd.to_datetime(df_record.Day, format='%M/%d').dt.day
df_record['month'] = pd.to_datetime(df_record.Day, format='%M/%d').dt.month

Merge the DataFrames and drop the columns we no longer need:

df = df_new.merge(df_record, on=['month', 'day']).drop(columns=['month', 'day', 'Day'])

Then check if a value is a record. If so, create a new DataFrame with the record values:

record_low = df.X_x < df.X_y
record_high = df.Y_x > df.Y_y
pd.DataFrame({'Date': df[record_low]['Date'].tolist() + df[record_high]['Date'].tolist(), 
 'Record Value': df[record_low]['X_x'].tolist() + df[record_high]['Y_x'].tolist()})

    Date    Record Value
0   1/1/2015    10
1   1/2/2015    25
2   1/4/2015    21
3   1/1/2015    35
4   1/3/2015    38
5   1/5/2015    47

Upvotes: 0

Related Questions