teach me the pandas dataframe way to do this?

Question

I'm struggling to convert from a spreadsheet to Python using Pandas Data Frames.

I have some raw data:

Date        Temperature
12/4/2003   100
12/5/2003   101
12/8/2003   100
12/9/2003   102
12/10/2003  101
12/11/2003  100
12/12/2003  99
12/15/2003  98
12/16/2003  97
12/17/2003  96
12/18/2003  95
12/19/2003  96
12/22/2003  97
12/23/2003  98
12/24/2003  99
12/26/2003  100
12/29/2003  101

In a spreadsheet, I am following a trend based on a %monitor. Think of it as a rolling average but based on %.

Output from spreadsheet:

date         temp   monitor   trend        change_in_trend
12/4/2003    100    97.00      warming      false
12/5/2003    101    97.97      warming      false
12/8/2003    100    97.97      warming      false
12/9/2003    102    98.94      warming      false
12/10/2003   101    98.94      warming      false
12/11/2003   100    98.94      warming      false
12/12/2003    99    98.94      warming      false
12/15/2003    98    98.94      cooling      true
12/16/2003    97    98.94      cooling      false
12/17/2003    96    98.88      cooling      false
12/18/2003    95    97.85      cooling      false
12/19/2003    96    97.85      cooling      false
12/22/2003    97    97.85      cooling      false
12/23/2003    98    97.85      warming      true
12/24/2003    99    97.85      warming      false
12/26/2003   100    97.85      warming      false
12/29/2003   101    97.97      warming      false

Assumptions:

percent_monitor = .03
warming_factor = 1 - percent_monitor
cooling_factor = 1 + percent_monitor

In my spreadsheet, I setup the columns in the first row as:

monitor = temp * warming_factor
trending = warming
change_in_trend = false

All remaining rows are derived based on current and previous row's column values.

monitor column logic:

if temp > prev_monitor:
    if temp > prev_temp:
        if temp * warming_factor > prev_monitor:
            monitor = temp*warming_factor
        else:
            monitor = prev_monitor
    else:
        monitor = prev_monitor
else:
    if temp < prev_monitor:
        if temp * cooling_factor < prev_monitor:
            monitor = temp * cooling_factor
        else:
            monitor = prev_monitor
    else:
        monitor = prev_monitor

trending column logic:

if temp > prev_monitor:
    trending = warming
else:
    trending = cooling

change in trend column logic:

if current_trend - previous_trend:
    change_in_trend = false
else:
    change in trend = true

I was able to iterate through a dataframe and apply the logic with no problem. However, performance is horrible over thousands of rows.

I've been trying to do this in a more 'pandas' like way but have failed in every attempt.

Without embarrassing myself by pasting my code attempts, is there anyone that could offer me some help?

Thanks in advance!

Aur · Accepted Answer

Because you are looking to just move this over to Python and are not set up Pandas in particular, I went for a non-pandas approach. I used your example rows and did 47124 rows in 0.182 seconds.

Pandas is really good and intuitive for some use cases, but can get extremely slow for iteration. This page explains some of the slower uses of Pandas one of them mainly being index iteration. A pandas-ey way to do this would to be to take advantage of 5. Vectorization with NumPy arrays, but your use case seems straightforward enough that this might be overkill and might not be worth it (given that your name is PythonNoob).

For the sake of both clarity and speed, a simple use of the more basic python functions can get you the speed you want.

First I set up the constants

percent_monitor = .03
warming_factor = 1 - percent_monitor
cooling_factor = 1 + percent_monitor

Then (for ease of use, there are cleaner ways to do this but this is very clear) I set the names of the columns corresponding to the column values:

DATE = 0
TEMP = 1
MONITOR = 2
TRENDING = 3
CHANGE_IN_TREND = 4

Then I pulled out your monitor code in its own function (and cleaned up the if-statements a bit:

def calculate_monitor(prev_monitor, current_temp, prev_temp):
     if (current_temp > prev_monitor) and (current_temp > prev_temp) and (current_temp * warming_factor) > prev_monitor:
            return current_temp * warming_factor
        elif (current_temp < prev_monitor) and ((current_temp * cooling_factor) < prev_monitor):
            return current_temp * cooling_factor
        else:
            return prev_monitor

Finally, I read in the code and processed it:

data = [] # I am going to append everything to this
with open('weather_data.csv') as csv_file:
    previous_row = None
    csv_reader = csv.reader(csv_file, delimiter=' ')
    line_count = 0
    for row in csv_reader:
        cleaned_row = list(filter(None, row))
        if line_count == 0:
            # first row is column -- I am leaving it blank you can do whatever you want with it
            line_count += 1
        elif line_count == 1: # this is the first line
            previous_row = cleaned_row + [float(cleaned_row[TEMP]) * warming_factor, "warming", False]
            data.append(previous_row)
            line_count += 1
        else:
            monitor = calculate_monitor(float(previous_row[MONITOR]), float(cleaned_row[TEMP]), float(previous_row[TEMP]))
            current_trend = 'warming' if float(cleaned_row[TEMP]) > float(previous_row[MONITOR]) else 'cooling'
            change_in_trend = False if current_trend != previous_row[CHANGE_IN_TREND] else True
            previous_row = cleaned_row + [monitor, current_trend, change_in_trend]
            data.append(previous_row)
            line_count += 1

That will get you the speed you need. If you want to convert this to a pandas dataframe at the end you can do:

df = pd.DataFrame(data, columns=['date', 'temp', 'monitor', 'current_trend', 'change_in_trend'])

teach me the pandas dataframe way to do this?

Answers (1)

Related Questions