Reputation: 43
I have a datafile with several months of minute data with lines like "2016-02-02 13:21(\t)value(\n)".
I need to plot the data (no problem with that) and calculate + plot an average for each month.
Is it possible in gnuplot?
I am able to get an overall average using
fit a "datafile" using 1:3 via a
I am also able to specify some time range for the fit using
fit [now_secs-3600*24*31:now_secs] b "datafile" using 1:3 via b
... and then plot them with
plot a t "Total average",b t "Last 31 days"
But no idea how to calculate and plot an average for each month (= one stepped line showing each month average)
Upvotes: 2
Views: 1413
Reputation: 25724
You can also do averages with the option smooth unique
which is available at least since gnuplot 4.0 (2004).
The following is an example which runs for gnuplot>=5.0.0 (Jan. 2015).
Some explanations (intentionally kept outside the script to keep it readable):
some random test data is created
the option smooth unique
returns the average of the y-values for identical x-values
so, if the format fmtYm = "%Y-%m"
only contains the year and month, all x-values, e.g. in 2024-03
will return 2024-03-01 00:00
and hence all y-values for this month will be averaged. Check help smooth
option unique
.
I want this average value plotted with boxes. However, gnuplot would center the box, e.g. at t0=2024-03-01 00:00
and not at the middle of the month
so, you have to calculate the middle of the month by calculating the next month, e.g. t1=2024-04-01 00:00
using the functions tm_year()
and tm_mon()
(check help tm_year
, help tm_mon
, help sprintf
, help strptime
).
note that the months returned by tm_mon()
are zero based, so you have to add +2
to get the following month via sprintf()
and strptime()
. Fortunately, gnuplot interprets, e.g. 2024-13
as 2025-01
.
Script: (works for gnuplot>=5.0.0, Jan. 2015)
### plot time data and monthly average
reset session
fmt = "%Y-%m-%d %H:%M"
fmtYm = "%Y-%m" # format only year and month
# create some random test data
set print $Data
t0 = time(0) # now
y0 = 100
do for [i=0:1000] {
print sprintf("%s %g", strftime(fmt,t0+i*3600*8), y0=y0+rand(0)*2-1)
}
set print
set format x "%Y\n%m-%d" timedate
set style fill transparent solid 0.4
set key noautotitle invert
tmc(col) = (t0=timecolumn(col,fmtYm), t1=strptime(fmtYm,sprintf("%4d-%02d",tm_year(t0),tm_mon(t0)+2)), (t0+t1)/2.)
plot $Data u (tmc(1)):3 smooth unique w boxes lc rgb 0xccccff ti "monthly average", \
'' u (timecolumn(1,fmt)):3 w l lc "red" ti "data"
### end of script
Result:
Upvotes: 0
Reputation: 7590
Here is a way to do it purely in gnuplot. This method can be adapted (with a not small amount of effort) to work with files that cross a year boundary or span more than one year. It works just fine if the data starts with January or not. It computes the ordinary average for each month (the arithmetic mean) treating each data point as one value for the month. With somewhat significant modification, it can be used to work with weighted averages as well.
This makes a significant use of the stats function to compute values. It is a little long, partly because I commented it heavily. It uses 5.0 features (NaN for undefined values and in-memory datablocks instead of temporary files), but comments note how to change these for earlier versions.
Note: This script must be run before setting time mode. The stats function will not work in time mode. Time conversions are handled by the script functions.
data_time_format = "%Y-%m-%d %H:%M" #date format in file
date_cols = 2 # Number of columns consumed by date format
# get numeric month value of time - 1=January, 12=December
get_month(x) = 0+strftime("%m",strptime(data_time_format,x))
# get numeric year value of time
get_year(x) = 0+strftime("%Y",strptime(data_time_format,x))
# get internal time representation of day 1 of month x in year y
get_month_first(x,y) = strptime("%Y-%m-%d",sprintf("%d-%d-01",y,x))
# get internal time representation of date
get_date(x) = strptime(data_time_format,x)
# get date string in file format corresponding to day y in month x of year z
get_date_string(x,y,z) = strftime(data_time_format,strptime("%Y-%m-%d",sprintf("%04d-%02d-%02d",z,x,y)))
# determine if date represented by z is in month x of year y
check_valid(x,y,z) = (get_date(z)>=get_month_first(x,y))&(get_date(z)<get_month_first(x+1,y))
# Determine year and month range represented by file
year = 0
stats datafile u (year=get_year(strcol(1)),get_month(strcol(1))) nooutput
month_min = STATS_min
month_max = STATS_max
# list of average values for each month
aves = ""
# fill missing months at beginning of year with 0
do for[i=1:(month_min-1)] {
aves = sprintf("%s %d",aves,0)
}
# compute average of each month and store it at the end of aves
do for[i=month_min:month_max] {
# In versions prior to 5.0, replace NaN with 1/0
stats datafile u (check_valid(i,year,strcol(1))?column(date_cols+1):NaN) nooutput
aves = sprintf("%s %f",aves,STATS_mean)
}
# day on which to plot average
baseday = 15
# In version prior to 5.0, replace $k with a temporary file name
set print $k
# Change this to start at 1 if we want to fill in prior months
do for [i=month_min:month_max] {
print sprintf("%s %s",get_date_string(i,baseday,year),word(aves,i))
}
set print
This script will create either a in-memory datablock or a temporary file for earlier versions (with the noted changes) that contains a similar file to the original, but containing one entry per month with the value of the monthly average.
At the beginning we need to define our date format and the number of columns that the date format consumes. From then on it is assumed that the data file is structured as datetime value
. Several functions are defined which make extensive use of the strptime function (to compute a date string to an internal integer) and the strftime function (to compute an internal representation to a string). Some of these functions compute both ways in order to extract the necessary values. Note the addition of 0 in the get_month and get_year function to convert a string value to an integer.
We do several steps with the data in order to build our resulting datablock/file.
Now to demonstrate this, suppose that we have the following data
2016-02-03 15:22 95
2016-02-20 18:03 23
2016-03-10 16:03 200
2016-03-15 03:02 100
2016-03-18 02:02 200
We wish to plot this data along with the average value for each month. We can run the above script, and we will get a datablock $k (make the commented change near the bottom to use a temporary file instead) containing the following
2016-02-15 00:00 59.000000
2016-03-15 00:00 166.666667
This is exactly the average values for each month. Now we can plot with
set xdata time
set timefmt data_time_format
set key outside top right
plot $k u 1:3 w points pt 7 t "Monthly Average",\
datafile u 1:3 with lines t "Original Data"
Here, just for illustration, I used points with the averages. Feel free to use any style that you want. If you choose to use steps, you will very likely want to adjust the day that is assigned† in the datablock/temporary file (probably the first or last day in the month depending on how you want to do it).
It is usually easier with a task like this to do some outside preprocessing, but this demonstrates that it is possible in pure gnuplot.
For example, to use the last day, the function can be defined as
get_date_string(x,y,z) = strftime(data_time_format,strptime("%Y-%m-%d",sprintf("%04d-%02d-01",z,x+1))-24*60*60)
This version actually computes the first day of the next month, and then subtracts one whole day from that. The second argument is ignored in this version, but preserved to allow it to be used without having to make any additional changes to the script.
Upvotes: 2
Reputation: 3765
With a recent version of gnuplot, you have the stats
command and you can do something something like this:
stats "datafile" using 1:3 name m0
month_sec=3600*24*30.5
do for [month=1:12] {
stats [now_secs-(i+1)*month_sec:(i+0)*now_secs-month_sec] "datafile" using 1:3 name sprintf("m%d")
}
you get m0_mean
value for the total mean and you get all m1_mean
m2_mean
variables for the previuos months etc... defined in gnuplot
Finally to plot the you should do something like:
plot 'datafile', for [month=0:12] value(sprintf("m%d_mean"))
see help stats
help for
help value
help sprintf
for more information on the above commands
Upvotes: 0