pietrovismara
pietrovismara

Reputation: 6291

Hourly frequency count with Python

I have this Hourly csv datas sorted like this day by day for hundreds days:

2011.05.16,00:00,1.40893
2011.05.16,01:00,1.40760
2011.05.16,02:00,1.40750
2011.05.16,03:00,1.40649

I want to make a count of how many times per hour the daily maximum value has been set, so if on 00:00 i had the maximum value of 2011.05.16 day i add 1 to 00:00 and so on. To do this i used a loop to count hours like indexes in this way:

def graph():    
Date, Time,  High = np.genfromtxt(myPath, delimiter=",",
                                  unpack = True,  converters={0:date_converter})                                                                           
numList = [""] * 24
index=0
hour=0    
count = [0] * 24

for eachHour in Time:        
    numList[hour] += str(High[index])        
    index += 1
    hour +=1        

    if hour == 24:           
        higher = (numList.index(max(numList)))
        count[higher] += 1            
        hour = 0            
        numList = [""] * 24

The problem is that in my datas often there's a gap with some hours missing, but the loop can't recognize it and continue putting values in the next hour index. I've searched everywhere, but i'm new to programming and this is my first "complex" work so i need more specific answers to my case for understand how it works. So how do you make an hourly frequency count like explained? The final result should be like:

00:00 n time max of the day   
01:00 n time max of the day   
02:00 n time max of the day  
etc

Upvotes: 5

Views: 7009

Answers (3)

Andy Hayden
Andy Hayden

Reputation: 375415

First read in the csv:

In [11]: df = pd.read_csv('foo.csv', sep=',', header=None, parse_dates=[[0, 1]])

In [12]: df.columns = ['date', 'val']

In [13]: df.set_index('date', inplace=True)

In [14]: df
Out[14]: 
                         val
date                        
2011-05-16 00:00:00  1.40893
2011-05-16 01:00:00  1.40760
2011-05-16 02:00:00  1.40750
2011-05-16 03:00:00  1.40649

Use resample to get each days maximum:

In [15]: day_max = df.resample('D', how='max')

Check whether each value is the day max:

In [16]: df['is_day_max'] = day_max.lookup(df.index.normalize(), len(df) * ['val']) == df.val

In [17]: df
Out[17]: 
                         val is_day_max
date                                   
2011-05-16 00:00:00  1.40893       True
2011-05-16 01:00:00  1.40760      False
2011-05-16 02:00:00  1.40750      False
2011-05-16 03:00:00  1.40649      False

And then sum these over each hour:

In [18]: df.groupby(df.index.time)['is_day_max'].sum()
Out[18]: 
00:00:00    1
01:00:00    0
02:00:00    0
03:00:00    0
Name: is_day_max, dtype: float64

Upvotes: 5

joris
joris

Reputation: 139152

A solution with pandas: Supposing you have a dataframe with the date is index, you can first add a column to indicate the maximum value per day, and then groupby by hour and sum the occurrences:

In [32]: df['daily_max'] = df.groupby(df.index.date).transform(lambda x: x==x.max())
In [33]: df
Out[33]: 
                       value daily_max
date_time                             
2011-05-16 00:00:00  1.40893      True
2011-05-16 01:00:00  1.40760     False
2011-05-16 02:00:00  1.40750     False
2011-05-16 03:00:00  1.40649     False
2011-05-17 02:00:00  1.40893      True
2011-05-17 03:00:00  1.40760     False
2011-05-17 04:00:00  1.40750     False
2011-05-17 05:00:00  1.40649     False
2011-05-18 02:00:00  1.40893      True
2011-05-18 03:00:00  1.40760     False
2011-05-18 04:00:00  1.40750     False
2011-05-18 05:00:00  1.40649     False

In [34]: df.groupby(df.index.time)['daily_max'].sum()
Out[34]: 
00:00:00    1
01:00:00    0
02:00:00    2
03:00:00    0
04:00:00    0
05:00:00    0
Name: daily_max, dtype: float64

With older pandas version, this will give the same result as above (assuming your df has a DatetimeIndex):

df['date'] = [t.date() for t in df.index.to_pydatetime()]
df['time'] = [t.time() for t in df.index.to_pydatetime()]
df['daily_max'] = df.groupby('date')['value'].transform(lambda x: x==x.max())
df.groupby('time')['daily_max'].sum()

The dataframe I used in this example:

from StringIO import StringIO

s="""2011.05.16,00:00,1.40893
2011.05.16,01:00,1.40760
2011.05.16,02:00,1.40750
2011.05.16,03:00,1.40649
2011.05.17,02:00,1.40893
2011.05.17,03:00,1.40760
2011.05.17,04:00,1.40750
2011.05.17,05:00,1.40649
2011.05.18,02:00,1.40893
2011.05.18,03:00,1.40760
2011.05.18,04:00,1.40750
2011.05.18,05:00,1.40649"""

df = pd.read_csv(StringIO(s), header=None, names=['date', 'time', 'value'], parse_dates=[['date', 'time']])
df = df.set_index('date_time')

Upvotes: 4

HazyBlueDot
HazyBlueDot

Reputation: 125

I'm not sure exactly what you are trying to calculate, but this is how I would approach doing calculations of the values in your data set

from time import strptime,strftime

time_format="%H:%M"
date_format="%Y.%m.%d"

def date_values(flo):
    for line in flo:
        try:
            date_str, time_str, value = line.split(',')
            date = strptime(date_str,"%Y.%m.%d")
            time = strptime(time_str,"%H:%M")
            value = float(value)
            yield (date, time, value)
        except ValueError:
            pass

def day_values(flo):
    days = {}
    for date,time,value in date_values(flo):
        try:
            days[date].append(value)
        except KeyError:
            days[date] = [ value ]

    return days

if __name__ == '__main__':
    from sys import stdin

    for day,values in day_values(stdin).items():
        print("{0}: {1} (max of {2})".format(
              strftime(date_format, day),
              values, 
              max(values)))

The date_values function will return a generator that iterates over valid lines of your data input. The day_values function uses the generator to construct a dictionary in which the keys are the dates and the values is an array of values for that date. From your description I'm not sure how the time comes into play, if you can clarify, or provide a larger data set with an example of the result you would like to see, I can expand on this.

If I name that filefreq_count.py and assuming your data set in a file named data, I get

$ python freq_count.py < data
2011.05.16: [1.40893, 1.4076, 1.4075, 1.40649] (max of 1.40893)

To count the frequency of the max value:

def count_freq(values):
    return len( [ v for v in values if v == max(values) ] ) 

that uses a list comprehension to generate a list containing all values that are the max in the values input, then take the length of the resulting list.

Upvotes: 0

Related Questions