Reputation: 6291
I have this Hourly csv datas sorted like this day by day for hundreds days:
2011.05.16,00:00,1.40893
2011.05.16,01:00,1.40760
2011.05.16,02:00,1.40750
2011.05.16,03:00,1.40649
I want to make a count of how many times per hour the daily maximum value has been set, so if on 00:00 i had the maximum value of 2011.05.16 day i add 1 to 00:00 and so on. To do this i used a loop to count hours like indexes in this way:
def graph():
Date, Time, High = np.genfromtxt(myPath, delimiter=",",
unpack = True, converters={0:date_converter})
numList = [""] * 24
index=0
hour=0
count = [0] * 24
for eachHour in Time:
numList[hour] += str(High[index])
index += 1
hour +=1
if hour == 24:
higher = (numList.index(max(numList)))
count[higher] += 1
hour = 0
numList = [""] * 24
The problem is that in my datas often there's a gap with some hours missing, but the loop can't recognize it and continue putting values in the next hour index. I've searched everywhere, but i'm new to programming and this is my first "complex" work so i need more specific answers to my case for understand how it works. So how do you make an hourly frequency count like explained? The final result should be like:
00:00 n time max of the day
01:00 n time max of the day
02:00 n time max of the day
etc
Upvotes: 5
Views: 7009
Reputation: 375415
First read in the csv:
In [11]: df = pd.read_csv('foo.csv', sep=',', header=None, parse_dates=[[0, 1]])
In [12]: df.columns = ['date', 'val']
In [13]: df.set_index('date', inplace=True)
In [14]: df
Out[14]:
val
date
2011-05-16 00:00:00 1.40893
2011-05-16 01:00:00 1.40760
2011-05-16 02:00:00 1.40750
2011-05-16 03:00:00 1.40649
Use resample to get each days maximum:
In [15]: day_max = df.resample('D', how='max')
Check whether each value is the day max:
In [16]: df['is_day_max'] = day_max.lookup(df.index.normalize(), len(df) * ['val']) == df.val
In [17]: df
Out[17]:
val is_day_max
date
2011-05-16 00:00:00 1.40893 True
2011-05-16 01:00:00 1.40760 False
2011-05-16 02:00:00 1.40750 False
2011-05-16 03:00:00 1.40649 False
And then sum these over each hour:
In [18]: df.groupby(df.index.time)['is_day_max'].sum()
Out[18]:
00:00:00 1
01:00:00 0
02:00:00 0
03:00:00 0
Name: is_day_max, dtype: float64
Upvotes: 5
Reputation: 139152
A solution with pandas: Supposing you have a dataframe with the date is index, you can first add a column to indicate the maximum value per day, and then groupby by hour and sum the occurrences:
In [32]: df['daily_max'] = df.groupby(df.index.date).transform(lambda x: x==x.max())
In [33]: df
Out[33]:
value daily_max
date_time
2011-05-16 00:00:00 1.40893 True
2011-05-16 01:00:00 1.40760 False
2011-05-16 02:00:00 1.40750 False
2011-05-16 03:00:00 1.40649 False
2011-05-17 02:00:00 1.40893 True
2011-05-17 03:00:00 1.40760 False
2011-05-17 04:00:00 1.40750 False
2011-05-17 05:00:00 1.40649 False
2011-05-18 02:00:00 1.40893 True
2011-05-18 03:00:00 1.40760 False
2011-05-18 04:00:00 1.40750 False
2011-05-18 05:00:00 1.40649 False
In [34]: df.groupby(df.index.time)['daily_max'].sum()
Out[34]:
00:00:00 1
01:00:00 0
02:00:00 2
03:00:00 0
04:00:00 0
05:00:00 0
Name: daily_max, dtype: float64
With older pandas version, this will give the same result as above (assuming your df
has a DatetimeIndex):
df['date'] = [t.date() for t in df.index.to_pydatetime()]
df['time'] = [t.time() for t in df.index.to_pydatetime()]
df['daily_max'] = df.groupby('date')['value'].transform(lambda x: x==x.max())
df.groupby('time')['daily_max'].sum()
The dataframe I used in this example:
from StringIO import StringIO
s="""2011.05.16,00:00,1.40893
2011.05.16,01:00,1.40760
2011.05.16,02:00,1.40750
2011.05.16,03:00,1.40649
2011.05.17,02:00,1.40893
2011.05.17,03:00,1.40760
2011.05.17,04:00,1.40750
2011.05.17,05:00,1.40649
2011.05.18,02:00,1.40893
2011.05.18,03:00,1.40760
2011.05.18,04:00,1.40750
2011.05.18,05:00,1.40649"""
df = pd.read_csv(StringIO(s), header=None, names=['date', 'time', 'value'], parse_dates=[['date', 'time']])
df = df.set_index('date_time')
Upvotes: 4
Reputation: 125
I'm not sure exactly what you are trying to calculate, but this is how I would approach doing calculations of the values in your data set
from time import strptime,strftime
time_format="%H:%M"
date_format="%Y.%m.%d"
def date_values(flo):
for line in flo:
try:
date_str, time_str, value = line.split(',')
date = strptime(date_str,"%Y.%m.%d")
time = strptime(time_str,"%H:%M")
value = float(value)
yield (date, time, value)
except ValueError:
pass
def day_values(flo):
days = {}
for date,time,value in date_values(flo):
try:
days[date].append(value)
except KeyError:
days[date] = [ value ]
return days
if __name__ == '__main__':
from sys import stdin
for day,values in day_values(stdin).items():
print("{0}: {1} (max of {2})".format(
strftime(date_format, day),
values,
max(values)))
The date_values function will return a generator that iterates over valid lines of your data input. The day_values function uses the generator to construct a dictionary in which the keys are the dates and the values is an array of values for that date. From your description I'm not sure how the time comes into play, if you can clarify, or provide a larger data set with an example of the result you would like to see, I can expand on this.
If I name that filefreq_count.py
and assuming your data set in a file named data
, I get
$ python freq_count.py < data
2011.05.16: [1.40893, 1.4076, 1.4075, 1.40649] (max of 1.40893)
To count the frequency of the max value:
def count_freq(values):
return len( [ v for v in values if v == max(values) ] )
that uses a list comprehension to generate a list containing all values that are the max in the values
input, then take the length of the resulting list.
Upvotes: 0