Reputation: 31
I'm relatively new to using awk/grep etc and want to filter some data. I have a large spreadsheet which I want to display the unique values column by column. For example I want to change this:
DS571187 DS571220 DS571200 DS571194
contig1 contig3 contig4 contig7
contig2 contig3 contig4 contig7
contig1 contig4 contig6 contig8
contig1 contig5 contig6 contig9
contig2 contig4 contig6 contig9
contig2
contig2
to something that looks like this:
DS571187 DS571220 DS571200 DS571194
contig1 contig3 contig4 contig7
contig2 contig4 contig6 contig8
contig5 contig9
Basically I'm trying to sort each column as its own list and get the unique values this way. Any help would be appreciated.
Amber
Upvotes: 3
Views: 1501
Reputation: 67507
a different approach, not necessarily efficient but easier to understand. Last two lines are for pretty printing.
$ function f() { cut -d$'\t' -f$1 file1 | sed '/^$/d' | sort -u; };
paste -d$'\t' <(f 1) <(f 2) <(f 3) <(f 4) |
sed 's/\t/ \t/g' |
column -ts$'\t'
DS571187 DS571220 DS571200 DS571194
contig1 contig3 contig4 contig7
contig2 contig4 contig6 contig8
contig5 contig9
Upvotes: 0
Reputation: 37404
Gawk probably needed, tab expected as delimiter, any one character delimiter works (-F"\t"
below):
$ cat > cs.awk
NR==1 {
nf=NF
$1=$1
print
}
NR>1 {
for(i=1;i<=NF;i++)
if($i!="")
a[i][$i]++
}
END {
for(i=1;i<=nf;i++)
n[i]=asorti(a[i])
j=asort(n)
for(i=1;i<=n[j];i++)
for(k=1;k<=nf;k++)
printf "%-8s%s", a[k][i], (k<nf?OFS:ORS)
}
$ awk -F"\t" -f cs.awk cs_by_ed.txt
DS571187 DS571220 DS571200 DS571194
contig1 contig3 contig4 contig7
contig2 contig4 contig8
contig5 contig9
Upvotes: 0
Reputation: 203522
Assuming your input file is tab-separated as it appears to be:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
for (colNr=1;colNr<=NF;colNr++) {
if (!seen[colNr,$colNr]++) {
val[++colRowNr[colNr],colNr] = $colNr
numRows = (colRowNr[colNr] > numRows ? colRowNr[colNr] : numRows)
}
}
numCols = (NF > numCols ? NF : numCols)
}
END {
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file | column -s$'\t' -t
DS571187 DS571220 DS571200 DS571194
contig1 contig3 contig4 contig7
contig2 contig4 contig6 contig8
contig5 contig9
The call to column
is just to make the alignment look pretty on the site.
If it's not tab-separated then to to this concisely and robustly you need GNU awk for FIELDWIDTHS to identify possibly empty fields in mid-line like this input (which you should test other potential solutions against as later input columns being shorter than earlier ones I expect can happen in your real data and makes this a harder problem to solve):
$ column -s$'\t' -t file
DS571187 DS571220 DS571200 DS571194
contig1 contig3 contig4 contig7
contig2 contig3 contig7
contig1 contig4 contig8
contig5 contig9
contig9
$ awk -f tst.awk file | column -s$'\t' -t
DS571187 DS571220 DS571200 DS571194
contig1 contig3 contig4 contig7
contig2 contig4 contig8
contig5 contig9
Upvotes: 2