Markus Weninger
Markus Weninger

Reputation: 12668

Gnuplot: Grouping data by certain column for plot

Imagine following file-format

Type Method Result Min  Max
-------------------------------
POGC Fast   10.4   9.4  15.6
POGC Slow   20.3   14.2 25.5
G1   Fast   5.0    4.4  5.2
G1   Slow   11.1   6.8  13.0

or, in CSV

Type;Method;Result;Min;Max
POGC;Fast;10.4;9.4;15.6
POGC;Slow;20.3;14.2;25.5
G1;Fast;5.0;4.4;5.2
G1;Slow;11.1;6.8;13.0

which should represent the result of some benchmark runs. What I would like to is to split this data into groups based on column Type, drawing one box per group for each Method, given the Result (y) and the deviation (yMin and yMax). The result should look something like the following:

Example chart

Is something like this possible in gnuplot?

In my real datasource, it would be 2 groups ("types"), and 7 bars ("methods") per group.

I look into set style histogram yet I was not able to figure out if this could be used for my plot. If I understood the documentation right, histogram starts a new group for each line, and one box per group for each column given in the plot (like plot 'file.dat' using 2, '' using 4, '' using 6 would result in 3 bars per group, and one group per row)

Upvotes: 2

Views: 3303

Answers (1)

Matthew
Matthew

Reputation: 7590

This is probably easier to have the data reformatted into a different design. Using a design like

Type Fast_Result Fast_Min Fast_Max Slow_Result Slow_Min Slow_Max

would make this trivial. An external program can be used to reformat the data. However, it is possible without doing any reformatting.

We need to assume that the types and methods have no spaces in the name. This allows us to use gnuplot string variables and the word/words functions to simulate arrays with them. If this assumption isn't met, this is significantly more difficult to accomplish.

For most of this, I am going to assume that the data looks like

POGC Fast   10.4   9.4  15.6
POGC Slow   20.3   14.2 25.5
G1   Fast   5.0    4.4  5.2
G1   Slow   11.1   6.8  13.0

If we use a CSV file, we can just do set datafile separator comma. If the first line is a title line, we can set it to autoskip with set key autotitle columnhead. In fact, with these two commands, there shouldn't be difference in the remaining commands.

Suppose that we have two variables, types and methods, containing the values of all possible types and methods

types = "POGC G1"
methods = "Fast Slow"

We first place the xaxis labels at the median of each type's set of boxes. We add one extra box to each group to set a space between groups. The first tic setting command effectively "clears" all tics so that we add the needed ones one-by-one

set xtics ()
set for[i=1:words(types)] xtic add (word(types,i) (1+words(methods))/2.0+(i-1)*(words(types)+1))

Now, we will set the boxwidth explicitly with set boxwidth 0.9. We use a value slightly less than 1 to allow a gap between each box.

Next, we will need a couple of functions. One will get the index in one of these list variables, the other will determine the x-coordinate to place a box at.

wordix(list,word) = sum[i=1:words(list)] (word(list,i) eq word)?i:0
xval(ty,me) = (wordix(types,ty)-1)*(words(methods)+1)+wordix(methods,me)

Because the box style tends to truncate the bottom of boxes, we will explicitly set our range with set yrange[0:*].

For the boxes, we need to iterate over each type, plotting them one at a time, to make sure that they use different styles as in the key. This requires us to use a conditional check to see which boxes to plot. In the condition we will select the third column if we use that box, or the invalid value 1/0 if we don't , which causes gnuplot to skip the box. We will use the vector style to plot the range lines. We can do these at once, because they are all styled the same. Now, we can plot with1

plot for[z=1:words(methods)] "data.txt" u (xval(strcol(1),strcol(2))):(strcol(2) eq word(methods,z)?$3:1/0) with boxes lt z t word(methods,z), \
     "" u (xval(strcol(1),strcol(2)):4:(0):($5-$4) with vectors lc black nohead not

to produce

enter image description here


As far as setting our initial types and methods variables, we either have to set them in the script or use external programs. We will assume that the data is in the semicolon deliminated csv format with a header row and is named data.txt.

If python3 is available, define a function (using windows shell quoting)

getcolumnvalues(x) = sprintf('python -c "data=set([x.split(\";\")[%d] for x in open(\"data.txt\",\"r\")][1:]);print(*sorted(data))"',x-1)

or, if python3 isn't available, but standard unix programs (awk, sort, uniq, and paste) are, we can define this as (again with windows shell quoting)

getcolumnvalues(x) = sprintf('awk -F; "(NR>1) {print $%d;}" data.txt | sort | uniq | paste -s -d" "',x)

Now, we can set our variables like

types = system(getcolumnvalues(1))
methods = system(getcolumnvalues(2))

1 I normally like to use i as my iteration variable, but notice that the wordix function uses that same variable for iteration. As we call that function during each iteration (through the xval function), we need to use a different variable for the plot iteration. This is an easy mistake to miss (I spent about 15 minutes while typing this up trying to figure out why it wasn't working because of that). In cases like this, it is important to remember that gnuplot, while having some powerful programming structures, does not have the scoping rules that would protect us in most languages. All variables are "global" and we must be careful of names.

Upvotes: 3

Related Questions