user2650277
user2650277

Reputation: 6739

Draw a curve of best fit with gnuplot

Suppose i have a set of points x,y to plot for an image with gnuplot.It works as expected and i get a nice curve.I want to repeat the experiment for a large dataset of images (say 1000).At this point you would get 1000 curves on one plot, each curve for one image.How do i tell gnuplot to draw a best fit of the curves?

I would like the gnuplot to give me the x,y point of the best fit curve in a csv as i plan to have a single plot of best fits later.

The data can be found here

Upvotes: 0

Views: 4362

Answers (2)

user8153
user8153

Reputation: 4095

If I understand you correctly you want to draw an average line through the data, rather than fitting the data for function. You can do this using the smooth option to the plot command.

Depending on your needs you could draw an interpolation function through your data. For example:

plot \
"libjpeg-2000-bench.png.csv" u 3:5 w p, \
"libjpeg-2000-mural.png.csv" u 3:5 w p, \
"libjpeg-2000-red-room.png.csv" u 3:5 w p, \
"libjpeg-bench.png.csv" u 3:5 w p, \
"libjpeg-mural.png.csv" u 3:5 w p, \
"libjpeg-red-room.png.csv" u 3:5 w p, \
 "< tail -q -n +4  libjpeg*csv" u 3:5 smooth acsplines   w l lw 2

gives

enter image description here

You might want to experiment with the various smoothing functions, see help smooth. Some of those functions also take additional parameters. For example, you can specify a weight for the acsplines interpolation:

plot \
"libjpeg-2000-bench.png.csv" u 3:5 w p, \
"libjpeg-2000-mural.png.csv" u 3:5 w p, \
"libjpeg-2000-red-room.png.csv" u 3:5 w p, \
"libjpeg-bench.png.csv" u 3:5 w p, \
"libjpeg-mural.png.csv" u 3:5 w p, \
"libjpeg-red-room.png.csv" u 3:5 w p, \
"< tail -q -n +4  libjpeg*csv" u 3:5:(100) smooth acsplines title "acsplines, weight = 100" w l lw 2,  \
"< tail -q -n +4  libjpeg*csv" u 3:5:(0.1) smooth acsplines title "acsplines, weight = 0.1" w l lw 2

enter image description here

The choice of the weight involves a trade-off: if the weight is large then the curve will follow the data points more closely, but will likely exhibit oscillations.

Alternatively you can bin the data points in the x direction, and average those data points that fall within the same bin. Luckily you can do all this from within gnuplot:

round(x) = floor(x+0.5)
bin(x,binwidth) = binwidth*round(x/binwidth)
binwidth = 1.
plot \
"libjpeg-2000-bench.png.csv" u 3:5 w p, \
"libjpeg-2000-mural.png.csv" u 3:5 w p, \
"libjpeg-2000-red-room.png.csv" u 3:5 w p, \
"libjpeg-bench.png.csv" u 3:5 w p, \
"libjpeg-mural.png.csv" u 3:5 w p, \
"libjpeg-red-room.png.csv" u 3:5 w p, \
 "< tail -q -n +4  libjpeg*csv"  u (bin($3,binwidth)):5 smooth uniq  w l lw 2

gives

enter image description here

Here you can adjust the binsize binwidth to your needs.

Upvotes: 2

ewcz
ewcz

Reputation: 13087

I have to admit that it's not completely clear to me what exactly you want to achieve, nevertheless I have also the feeling that, as mentioned by @KevinBoone in the comments, you are trying to do some kind of binned statistic on the data. If this is the case, then Gnuplot is unfortunately not the proper tool for this task. In my opinion, it would be much more practical to delegate this processing task to something more appropriate.

As an example, let's say that the strategy would indeed be:

  1. load all the csv files in the current directory
  2. divide the x-range into M bins and calculate the average of the y-values that fall into each of the bins
  3. plot this "averaged" data

To this end, one might prepare a short Python script (which implements the steps outlined above) based on the binned_statistic function provided by the scipy toolkit. The required number of bins is passed as first argument, while the remaining arguments are interpreted as csv files for processing:

#!/usr/bin/env python
import sys

import numpy as np
from scipy.stats import binned_statistic

num_of_bins = int(sys.argv[1])

data = []
for fname in sys.argv[2:]:    
    with open(fname, 'r') as F:
        for line_id, line in enumerate(F):
            if line_id < 3: continue

            cols = line.strip().split(',')
            x, y = map(float, [cols[i] for i in [2, 3]])
            data.append((x, y))

data = np.array(data)
stat, bin_edges, _ = binned_statistic(data[:, 0], data[:, 1], 'mean', bins = num_of_bins, range = None)

for val, (lb, ub) in zip(stat, zip(bin_edges, bin_edges[1:])):
    print('%E,%E' % ( (lb+ub)/2, val ))

Now, in Gnuplot, we can invoke this script (lets say that it is stored in the current working directory as stat.py) externally and plot it together with the individual files:

set terminal pngcairo enhanced
set output 'fig.png'

#get all csv files in current directory as a space-delimited string
files = system("ls *.csv | xargs")

#construct a "pretty" label from the file name
getLabel(fname)=system(sprintf('echo "%s" | gawk -F"-" "BEGIN{OFS=\"-\"} {NF=NF-2;print}"', fname))

set datafile separator ","
set key spacing 1.5

LINE_WIDTH = 1.25
plot \
    for [filename in files] filename u 3:4 w l lw LINE_WIDTH t getLabel(filename), \
    sprintf('<python ./stat.py 20 %s', files) w l lw 3*LINE_WIDTH lc rgb 'red' t 'average'

With some of the sample data you provided in the comments, this produces: enter image description here

However, as pointed out by @KevinBoone, whether this "average" has a justifiable mathematical meaning in your specific setting is another question on its own...

Upvotes: 1

Related Questions