Reputation: 51
I have a large file with about 200,000 columns and about 5000 rows. Here is a short example of the file, with columns 1 and 5 duplicated.
Abf Bgj Csd Daa Abf Efg ...
0 1 2 1 0 1.1
2 0.1 1.2 0.3 2 1
...
Here is an example of the result I need. Column 5 in the original file has been deleted.
Abf Bgj Csd Daa Efg ...
0 1 2 1 1.1
2 0.1 1.2 0.3 1
...
Some of the columns are duplicated several times. I need to remove the duplicates from the data (keeping the first instance) using bash tools. I can´t sort the data because I need to keep the order.
Upvotes: 1
Views: 187
Reputation: 1093
You can use datamash
program:
datamash -W transpose < input.txt | datamash rmdup 1 | datamash transpose
GNU datamash is a command-line program which performs basic numeric,textual and statistical operations on input textual data files.
Explanation:
datamash -W transpose < input.txt
datamash rmdup 1
- remove duplicates lines by the first column valuedatamash transpose
- swap rows and columns backinput
Abf Bgj Csd Daa Abf Efg
0 1 2 1 0 1.1
2 0.1 1.2 0.3 2 1
output
Abf Bgj Csd Daa Efg
0 1 2 1 1.1
2 0.1 1.2 0.3 1
Upvotes: 0
Reputation: 203532
$ cat tst.awk
NR==1 {
for (i=1;i<=NF;i++) {
if (!seen[$i]++) {
f[++nf]=i
}
}
}
{
for (i=1;i<=nf;i++) {
printf "%s%s", $(f[i]), (i<nf?OFS:ORS)
}
}
$ awk -f tst.awk file | column -t
Abf Bgj Csd Daa Efg
0 1 2 1 1.1
2 0.1 1.2 0.3 1
Upvotes: 5