kirbo
kirbo

Reputation: 1727

gnuplot computing stats over multiple columns

I have a simple 9 column file. I wan't to compute certain statistics for each column and then plot it (using gnuplot).

1) This is how I compute statistics for every column excluding the first one.

stats 'data' every ::2 name "stats"

2) In the output screen I can see that the operation is successful. Note that the number of columns/records is 8

* FILE: 
  Records:      8
  Out of range: 0
  Invalid:      0
  Blank:        0
  Data Blocks:  1

* COLUMNS:
  Mean:          6.5000       491742.6625
  Std Dev:       2.2913          703.4865
  Sum:          52.0000       3.93394e+06
  Sum Sq.:     380.0000       1.93449e+12

  Minimum:       3.0000 [0]   490312.0000 [2]
  Maximum:      10.0000 [7]   492643.5000 [7]
  Quartile:      4.5000       491329.5000
  Median:        6.5000       491911.1500
  Quartile:      8.5000       492252.2500

  Linear Model: y = 121.8 x + 4.91e+05
  Correlation:  r = 0.3966
  Sum xy:       2.558e+07

3) Now I can access statistics on the first 2 columns by appending _x and _y like this

print stats_median_x
print stats_median_y

My questions are:

I know that I can simply add a python script to pre-compute all this, but I would prefer to avoid it if there is an easy way to do it using gnuplot itself.

Thanks!

Upvotes: 2

Views: 5464

Answers (1)

Hastur
Hastur

Reputation: 2818

Short answer(s)

  • "How can I access statistics of the other column?"
    with stats 'data'using n you will access to the nth column...
  • "How can I plot for example all medians?"
    e.g. a set print and a do for cycle can create a data-file that you can use for the plot.

A working solution

    set print "StatDat.dat" 
    do for [i=2:9] { # Here you will use i for the column.
      stats  'data.dat' u i nooutput ; 
      print i, STATS_median, STATS_mean , STATS_stddev # ...
    } 
    set print
    plot "StatDat.dat" us 1:2 # or whatever column you want...

Some words more about it
Asking help to gnuplot itself with help stats it's possible to read a lot of interesting things :-).

Syntax:
stats 'filename' [using N[:M]] [name 'prefix'] [[no]output]]
This command prepares a statistical summary of the data in one or two columns of a file. The using specifier is interpreted in the same way as for plot commands. See plot for details on the index, every, and using directives.

  • From the first highlighted sentence we can understand that it prepares statistics for one or maximum two column each time (It's a pity let's see in future...).
  • From the second highlighted sentence it's possible to read that it will follow the same syntax of the plot command:
    so stats 'data'using 3 will give you the statistic of the 3rd column in x
    and stats 'data' using 4:5 of the 4th and 5th in x,y...

Notes about your interpretations

  1. You said

    This is how I compute statistics for every column excluding the first one.
    stats 'data' every ::2 name "stats"

    Not really this is the statistic for the first two column excluding the first two lines, indeed their counter starts from 0 and not from 1.

  2. As consequence of the above assumption/interpretation, when we read

    Records: 8

    it means that the lines computed where 8; your file had 10 (usable) lines, you specify every ::2 and you skip the first two, thus you have 8 records useful for the statistic.
    Indeed so we can better understand when in help stats it is said

    STATS_records           # total number of in-range data records
    

    implying "used to compute this statistic".

Tested on gnuplot 4.6 patchlevel 4
Working on gnuplot Version 5.0 patchlevel 1

Upvotes: 7

Related Questions