Reputation: 107
So, I have an output that looks like this:
samples pops condition 1 condition 2 condition 3
A10051 15 1 3 4
A10051 15 2 4 4
A10052 15 2 1 4
A10052 15 2 1 4
However, for the next analysis I need the input to look like this
samples pops condition 1 condition 1 condition 2 condition 2 condition 3 condition 3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
So, it is not just making it so that every other row is a new column, every other row in a given column would be in a new column assigned to that same condition, in a way that each sample has two columns for the same condition and not two rows for the same sample. For the example I put 2 samples and 3 conditions, however IRL I have over 100 samples and over 1000 conditions... any thoughts? I am confident it can be done with awk, but I just can not figure it out.
Upvotes: 2
Views: 220
Reputation: 18697
A simple solution (no output headers) with GNU datamash
(which is a nice tool for "command-line statistical operations" on textual files):
$ grep -v ^$ file | datamash -W -g1 --header-in first 2 collapse 3-5 | tr ',' ' ' | column -t
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
First, skip all blank lines with grep
, then with datamash
group lines according to the first field (-g1
), using whitespace(s) as field separators (-W
), collapsing multiple rows in a group for fields 3, 4 and 5. Collapsed values are comma separated, that's why we have to break them with tr
.
For a different number of columns, just adapt the range for collapse
operation (e.g. collapse 3-1000
). And due to grouping operation, any number of samples per group is already supported.
Upvotes: 2
Reputation: 754110
Taking the assertion 'the data is perfect' at face value and disregarding years of experience which indicates that data is seldom if ever perfect, then:
awk 'NR == 1 { printf "%s %s %s %s %s %s %s %s\n",
$1, $2, $3, $3, $4, $4, $5, $5; next }
NR == 2 { next }
NR % 2 == 1 { c[1] = $3; c[2] = $4; c[3] = $5 }
NR % 2 == 0 { printf "%s %d %d %d %d %d %d %d\n",
$1, $2, c[1], $3, c[2], $4, c[3], $5 }' "$@"
Given the input file:
samples pops condition_1 condition_2 condition_3
A10051 15 1 3 4
A10051 15 2 4 4
A10052 15 2 1 4
A10052 15 2 1 4
the script produces the output:
samples pops condition_1 condition_1 condition_2 condition_2 condition_3 condition_3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
This code is more mechanical than interesting. If you have 10 columns in each line, you'd approach it differently. You'd probably use loops to save and print the data. If you want a blank line between the headings and the data, you can easily add one (NR == 2 { print; next }
or use \n\n
in place of \n
in the first printf
function). You can arrange for the output fields to be separated by tabs if you wish (they're separated by double spaces in this code).
The code does not depend on tabs separating the data fields; it only depends on there being no white space within a field.
When there are many condition columns, you need to use arrays and loops to capture and print the data, like this:
awk 'NR == 1 { printf "%s %s", $1, $2
for (i = 3; i <= NF; i++) printf " %s %s", $i, $i
print ""
next
}
NR == 2 { next }
NR % 2 == 1 { for (i = 3; i <= NF; i++) c[i] = $i }
NR % 2 == 0 { printf "%s %d", $1, $2;
for (i = 3; i <= NF; i++) printf " %d %d", c[i], $i
print ""
}' "$@"
When run on the same data as before, it produces the same output as before, but the loops would allow it to read 1000 conditions per input line and generate 2000 conditions per output line. The only possible issue is whether your version of Awk handles such long input lines in the first place. If need be, upgrade to GNU Awk.
Upvotes: 2
Reputation: 67507
awk
to the rescue!
awk '{k=$1 FS $2}
NR==1 {p0=$0; pk=k}
pk==k {split(p0,a); for(i=3;i<=NF;i++) $i=a[i] FS $i; print}
pk!=k {p0=$0; pk=$1 FS $2}' file
samples pops condition_1 condition_1 condition_2 condition_2 condition_3 condition_3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
will work for unspecified number of columns and records, as long as they are all well-formed (same number of columns) and grouped (same keys are in sequence).
Upvotes: 1