gwarr
gwarr

Reputation: 51

Remove columns from file bash tools

I have a large file with about 200,000 columns and about 5000 rows. Here is a short example of the file, with columns 1 and 5 duplicated.

Abf Bgj Csd Daa Abf Efg ...  
0   1   2   1   0   1.1   
2   0.1 1.2 0.3 2   1    
...  

Here is an example of the result I need. Column 5 in the original file has been deleted.

Abf Bgj Csd Daa Efg ...  
0   1   2   1   1.1    
2   0.1 1.2 0.3 1      
...  

Some of the columns are duplicated several times. I need to remove the duplicates from the data (keeping the first instance) using bash tools. I can´t sort the data because I need to keep the order.

Upvotes: 1

Views: 187

Answers (2)

MiniMax
MiniMax

Reputation: 1093

You can use datamash program:

datamash -W transpose < input.txt | datamash rmdup 1 | datamash transpose

GNU datamash is a command-line program which performs basic numeric,textual and statistical operations on input textual data files.

Explanation:

  1. datamash -W transpose < input.txt
    • transpose - swap rows and columns. Rows now are columns and columns are rows.
    • -W - use whitespace (one or more spaces and/or tabs) for field delimiters.
  2. datamash rmdup 1 - remove duplicates lines by the first column value
  3. datamash transpose - swap rows and columns back

input

Abf Bgj Csd Daa Abf Efg
0   1   2   1   0   1.1   
2   0.1 1.2 0.3 2   1

output

Abf Bgj Csd Daa Efg
0   1   2   1   1.1
2   0.1 1.2 0.3 1

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203532

$ cat tst.awk
NR==1 {
    for (i=1;i<=NF;i++) {
        if (!seen[$i]++) {
            f[++nf]=i
        }
    }
}
{
    for (i=1;i<=nf;i++) {
        printf "%s%s", $(f[i]), (i<nf?OFS:ORS)
    }
}

$ awk -f tst.awk file | column -t
Abf  Bgj  Csd  Daa  Efg
0    1    2    1    1.1
2    0.1  1.2  0.3  1

Upvotes: 5

Related Questions