A. Leckenby
A. Leckenby

Reputation: 31

Display unique values in each column using awk

I'm relatively new to using awk/grep etc and want to filter some data. I have a large spreadsheet which I want to display the unique values column by column. For example I want to change this:

DS571187    DS571220    DS571200    DS571194  
contig1     contig3     contig4     contig7  
contig2     contig3     contig4     contig7  
contig1     contig4     contig6     contig8  
contig1     contig5     contig6     contig9  
contig2     contig4     contig6     contig9  
contig2         
contig2 

to something that looks like this:

DS571187    DS571220    DS571200    DS571194
contig1     contig3     contig4     contig7
contig2     contig4     contig6     contig8
            contig5                 contig9     

Basically I'm trying to sort each column as its own list and get the unique values this way. Any help would be appreciated.

Amber

Upvotes: 3

Views: 1501

Answers (3)

karakfa
karakfa

Reputation: 67507

a different approach, not necessarily efficient but easier to understand. Last two lines are for pretty printing.

$ function f() { cut -d$'\t' -f$1 file1 | sed '/^$/d' | sort -u; }; 
  paste -d$'\t' <(f 1) <(f 2) <(f 3) <(f 4) | 
  sed 's/\t/ \t/g' | 
  column -ts$'\t'

DS571187   DS571220   DS571200   DS571194
contig1    contig3    contig4    contig7
contig2    contig4    contig6    contig8
           contig5               contig9

Upvotes: 0

James Brown
James Brown

Reputation: 37404

Gawk probably needed, tab expected as delimiter, any one character delimiter works (-F"\t" below):

$ cat > cs.awk
NR==1 {
    nf=NF
    $1=$1
    print
}
NR>1 {
    for(i=1;i<=NF;i++)
        if($i!="")
            a[i][$i]++
}
END {
    for(i=1;i<=nf;i++)
        n[i]=asorti(a[i])
    j=asort(n)
    for(i=1;i<=n[j];i++)
        for(k=1;k<=nf;k++)
            printf "%-8s%s", a[k][i], (k<nf?OFS:ORS)
}
$ awk -F"\t" -f cs.awk cs_by_ed.txt
DS571187 DS571220 DS571200 DS571194
contig1  contig3  contig4  contig7
contig2  contig4           contig8
         contig5           contig9

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203522

Assuming your input file is tab-separated as it appears to be:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    for (colNr=1;colNr<=NF;colNr++) {
        if (!seen[colNr,$colNr]++) {
            val[++colRowNr[colNr],colNr] = $colNr
            numRows = (colRowNr[colNr] > numRows ? colRowNr[colNr] : numRows)
        }
    }
    numCols = (NF > numCols ? NF : numCols)
}
END {
    for (rowNr=1;rowNr<=numRows;rowNr++) {
        for (colNr=1;colNr<=numCols;colNr++) {
            printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file | column -s$'\t' -t
DS571187  DS571220  DS571200  DS571194
contig1   contig3   contig4   contig7
contig2   contig4   contig6   contig8
          contig5             contig9

The call to column is just to make the alignment look pretty on the site.

If it's not tab-separated then to to this concisely and robustly you need GNU awk for FIELDWIDTHS to identify possibly empty fields in mid-line like this input (which you should test other potential solutions against as later input columns being shorter than earlier ones I expect can happen in your real data and makes this a harder problem to solve):

$ column -s$'\t' -t file
DS571187  DS571220  DS571200  DS571194
contig1   contig3   contig4   contig7
contig2   contig3             contig7
contig1   contig4             contig8
          contig5             contig9
                              contig9

$ awk -f tst.awk file | column -s$'\t' -t
DS571187  DS571220  DS571200  DS571194
contig1   contig3   contig4   contig7
contig2   contig4             contig8
          contig5             contig9

Upvotes: 2

Related Questions