bapors
bapors

Reputation: 909

Grouping the data into categories based on a column

I have a tab delimited file which have 2 columns as:

new.txt
    1.01   yes
    2.00   no
    0.93   no
    1.2223 yes
    1.7211 no

I want to modify the contents of it as if there are two categories as:

new_categorized.txt
yes    no
1.01   2.00
1.2223 0.93
       1.7211

I have found a similar question with an answer in R (here) ,however I need to do it with bash or awk.. I would appreciate your help.

Upvotes: 1

Views: 206

Answers (3)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

GNU awk solution:

awk '{ a[$2][($2=="yes"? ++y : ++n)]=$1 }
     END{ 
         max=(y > n? y:n); 
         print "yes","no";
         for(i=1; i<=max; i++) print a["yes"][i], a["no"][i] 
     }' OFS='\t' file | column -tn

The output:

yes     no
1.01    2.00
1.2223  0.93
        1.7211

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 203645

$ cat tst.awk
BEGIN { FS=OFS="\t" }
!($2 in label2colNr) {
    label2colNr[$2] = ++numCols
    colNr2label[numCols] = $2
}
{
    colNr = label2colNr[$2]
    val[++numRows[colNr],colNr] = $1
    maxRows = (numRows[colNr] > maxRows ? numRows[colNr] : maxRows)
}
END {
    for (colNr=1; colNr <= numCols; colNr++) {
        printf "%s%s", colNr2label[colNr], (colNr<numCols ? OFS : ORS)
    }

    for (rowNr=1; rowNr <= maxRows; rowNr++) {
        for (colNr=1; colNr <= numCols; colNr++) {
            printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
yes     no
1.01    2.00
1.2223  0.93
        1.7211

The above will work with any awk in any shell on any UNIX system no matter how many categories you have in the 2nd field and no matter what their values are.

Upvotes: 3

Cyrus
Cyrus

Reputation: 88646

With bash, GNU grep and paste:

echo -e "yes\tno"
paste <(grep -Po '^\t\K.*(?=\tyes)' new.txt) <(grep -Po '^\t\K.*(?=\tno)' new.txt)

Output:

yes     no
1.01    2.00
1.2223  0.93
        1.7211

Upvotes: 2

Related Questions