Reputation: 25
I'm struggling to convert from a spreadsheet to Python using Pandas Data Frames.
I have some raw data:
Date Temperature
12/4/2003 100
12/5/2003 101
12/8/2003 100
12/9/2003 102
12/10/2003 101
12/11/2003 100
12/12/2003 99
12/15/2003 98
12/16/2003 97
12/17/2003 96
12/18/2003 95
12/19/2003 96
12/22/2003 97
12/23/2003 98
12/24/2003 99
12/26/2003 100
12/29/2003 101
In a spreadsheet, I am following a trend based on a %monitor. Think of it as a rolling average but based on %.
Output from spreadsheet:
date temp monitor trend change_in_trend
12/4/2003 100 97.00 warming false
12/5/2003 101 97.97 warming false
12/8/2003 100 97.97 warming false
12/9/2003 102 98.94 warming false
12/10/2003 101 98.94 warming false
12/11/2003 100 98.94 warming false
12/12/2003 99 98.94 warming false
12/15/2003 98 98.94 cooling true
12/16/2003 97 98.94 cooling false
12/17/2003 96 98.88 cooling false
12/18/2003 95 97.85 cooling false
12/19/2003 96 97.85 cooling false
12/22/2003 97 97.85 cooling false
12/23/2003 98 97.85 warming true
12/24/2003 99 97.85 warming false
12/26/2003 100 97.85 warming false
12/29/2003 101 97.97 warming false
Assumptions:
percent_monitor = .03
warming_factor = 1 - percent_monitor
cooling_factor = 1 + percent_monitor
In my spreadsheet, I setup the columns in the first row as:
monitor = temp * warming_factor
trending = warming
change_in_trend = false
All remaining rows are derived based on current and previous row's column values.
monitor column logic:
if temp > prev_monitor:
if temp > prev_temp:
if temp * warming_factor > prev_monitor:
monitor = temp*warming_factor
else:
monitor = prev_monitor
else:
monitor = prev_monitor
else:
if temp < prev_monitor:
if temp * cooling_factor < prev_monitor:
monitor = temp * cooling_factor
else:
monitor = prev_monitor
else:
monitor = prev_monitor
trending column logic:
if temp > prev_monitor:
trending = warming
else:
trending = cooling
change in trend column logic:
if current_trend - previous_trend:
change_in_trend = false
else:
change in trend = true
I was able to iterate through a dataframe and apply the logic with no problem. However, performance is horrible over thousands of rows.
I've been trying to do this in a more 'pandas' like way but have failed in every attempt.
Without embarrassing myself by pasting my code attempts, is there anyone that could offer me some help?
Thanks in advance!
Upvotes: 2
Views: 139
Reputation: 215
Because you are looking to just move this over to Python and are not set up Pandas in particular, I went for a non-pandas approach. I used your example rows and did 47124
rows in 0.182
seconds.
Pandas is really good and intuitive for some use cases, but can get extremely slow for iteration. This page explains some of the slower uses of Pandas one of them mainly being index iteration. A pandas-ey way to do this would to be to take advantage of 5. Vectorization with NumPy arrays
, but your use case seems straightforward enough that this might be overkill and might not be worth it (given that your name is PythonNoob).
For the sake of both clarity and speed, a simple use of the more basic python functions can get you the speed you want.
First I set up the constants
percent_monitor = .03
warming_factor = 1 - percent_monitor
cooling_factor = 1 + percent_monitor
Then (for ease of use, there are cleaner ways to do this but this is very clear) I set the names of the columns corresponding to the column values:
DATE = 0
TEMP = 1
MONITOR = 2
TRENDING = 3
CHANGE_IN_TREND = 4
Then I pulled out your monitor code in its own function (and cleaned up the if
-statements a bit:
def calculate_monitor(prev_monitor, current_temp, prev_temp):
if (current_temp > prev_monitor) and (current_temp > prev_temp) and (current_temp * warming_factor) > prev_monitor:
return current_temp * warming_factor
elif (current_temp < prev_monitor) and ((current_temp * cooling_factor) < prev_monitor):
return current_temp * cooling_factor
else:
return prev_monitor
Finally, I read in the code and processed it:
data = [] # I am going to append everything to this
with open('weather_data.csv') as csv_file:
previous_row = None
csv_reader = csv.reader(csv_file, delimiter=' ')
line_count = 0
for row in csv_reader:
cleaned_row = list(filter(None, row))
if line_count == 0:
# first row is column -- I am leaving it blank you can do whatever you want with it
line_count += 1
elif line_count == 1: # this is the first line
previous_row = cleaned_row + [float(cleaned_row[TEMP]) * warming_factor, "warming", False]
data.append(previous_row)
line_count += 1
else:
monitor = calculate_monitor(float(previous_row[MONITOR]), float(cleaned_row[TEMP]), float(previous_row[TEMP]))
current_trend = 'warming' if float(cleaned_row[TEMP]) > float(previous_row[MONITOR]) else 'cooling'
change_in_trend = False if current_trend != previous_row[CHANGE_IN_TREND] else True
previous_row = cleaned_row + [monitor, current_trend, change_in_trend]
data.append(previous_row)
line_count += 1
That will get you the speed you need. If you want to convert this to a pandas dataframe at the end you can do:
df = pd.DataFrame(data, columns=['date', 'temp', 'monitor', 'current_trend', 'change_in_trend'])
Upvotes: 1