Silvia Justi
Silvia Justi

Reputation: 107

Making every other row into a new column

So, I have an output that looks like this:

samples pops    condition 1     condition 2 condition 3

A10051  15  1   3   4   
A10051  15  2   4   4   
A10052  15  2   1   4   
A10052  15  2   1   4

However, for the next analysis I need the input to look like this

samples pops    condition 1     condition 1     condition 2     condition 2 condition 3 condition 3

A10051  15  1   2   3   4   4   4       
A10052  15  2   2   1   1   4   4   

So, it is not just making it so that every other row is a new column, every other row in a given column would be in a new column assigned to that same condition, in a way that each sample has two columns for the same condition and not two rows for the same sample. For the example I put 2 samples and 3 conditions, however IRL I have over 100 samples and over 1000 conditions... any thoughts? I am confident it can be done with awk, but I just can not figure it out.

Upvotes: 2

Views: 220

Answers (3)

randomir
randomir

Reputation: 18697

A simple solution (no output headers) with GNU datamash (which is a nice tool for "command-line statistical operations" on textual files):

$ grep -v ^$ file | datamash -W -g1 --header-in first 2 collapse 3-5 | tr ',' ' ' | column -t
A10051  15  1  2  3  4  4  4
A10052  15  2  2  1  1  4  4

First, skip all blank lines with grep, then with datamash group lines according to the first field (-g1), using whitespace(s) as field separators (-W), collapsing multiple rows in a group for fields 3, 4 and 5. Collapsed values are comma separated, that's why we have to break them with tr.

For a different number of columns, just adapt the range for collapse operation (e.g. collapse 3-1000). And due to grouping operation, any number of samples per group is already supported.

Upvotes: 2

Jonathan Leffler
Jonathan Leffler

Reputation: 754110

3 condition columns

Taking the assertion 'the data is perfect' at face value and disregarding years of experience which indicates that data is seldom if ever perfect, then:

awk 'NR == 1 { printf "%s  %s  %s  %s  %s  %s  %s  %s\n",
                      $1, $2, $3, $3, $4, $4, $5, $5; next }
     NR == 2 { next }
     NR % 2 == 1 { c[1] = $3; c[2] = $4; c[3] = $5 }
     NR % 2 == 0 { printf "%s  %d  %d  %d  %d  %d  %d  %d\n",
                          $1, $2, c[1], $3, c[2], $4, c[3], $5 }' "$@"

Given the input file:

samples pops    condition_1     condition_2 condition_3

A10051  15  1   3   4
A10051  15  2   4   4
A10052  15  2   1   4
A10052  15  2   1   4

the script produces the output:

samples  pops  condition_1  condition_1  condition_2  condition_2  condition_3  condition_3
A10051  15  1  2  3  4  4  4
A10052  15  2  2  1  1  4  4

This code is more mechanical than interesting. If you have 10 columns in each line, you'd approach it differently. You'd probably use loops to save and print the data. If you want a blank line between the headings and the data, you can easily add one (NR == 2 { print; next } or use \n\n in place of \n in the first printf function). You can arrange for the output fields to be separated by tabs if you wish (they're separated by double spaces in this code).

The code does not depend on tabs separating the data fields; it only depends on there being no white space within a field.

Many condition columns

When there are many condition columns, you need to use arrays and loops to capture and print the data, like this:

awk 'NR == 1 { printf "%s  %s", $1, $2
               for (i = 3; i <= NF; i++) printf "  %s  %s", $i, $i
               print ""
               next
             }
     NR == 2 { next }
     NR % 2 == 1 { for (i = 3; i <= NF; i++) c[i] = $i }
     NR % 2 == 0 { printf "%s  %d", $1, $2;
                   for (i = 3; i <= NF; i++) printf "  %d  %d", c[i], $i
                   print ""
                 }' "$@"

When run on the same data as before, it produces the same output as before, but the loops would allow it to read 1000 conditions per input line and generate 2000 conditions per output line. The only possible issue is whether your version of Awk handles such long input lines in the first place. If need be, upgrade to GNU Awk.

Upvotes: 2

karakfa
karakfa

Reputation: 67507

awk to the rescue!

awk     '{k=$1 FS $2} 
   NR==1 {p0=$0; pk=k}
   pk==k {split(p0,a); for(i=3;i<=NF;i++) $i=a[i] FS $i; print}
   pk!=k {p0=$0; pk=$1 FS $2}' file

samples pops condition_1 condition_1 condition_2 condition_2 condition_3 condition_3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4

will work for unspecified number of columns and records, as long as they are all well-formed (same number of columns) and grouped (same keys are in sequence).

Upvotes: 1

Related Questions