Reputation: 83
I am working on a large csv file (millions of rows and 80 thousand columns). I want to extract and save in a new file all rows and only those columns that are listed in an external text file. For instance:
id,snp1,snp2,snp3,snp4,snp5,snp6,snp7,snp8,snp9,snp10
sampl1,AA,BB,AB,BB,AA,AA,AB,BB,BB,BB
sampl2,AA,BB,BB,BB,AB,AA,AB,BB,BB,BB
sampl3,AA,BB,AB,BB,BB,AA,AA,BB,BB,BB
sampl4,AA,BB,AA,BB,AB,AA,BB,BB,BB,BB
sampl5,AA,BB,AB,BB,AB,AA,AA,BB,BB,BB
sampl6,AA,BB,AB,BB,BB,AA,AB,BB,BB,BB
sampl7,AA,BB,BB,AB,AB,AA,AB,BB,BB,BB
snp3
snp6
snp7
snp10
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB
Is there an efficient approach to do that using awk?
Upvotes: 0
Views: 107
Reputation: 204548
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
list["id"]
list[$0]
next
}
FNR==1 {
for (i=1; i<=NF; i++) {
if ($i in list) {
f[++nf] = i
}
}
}
{
for (i=1; i<=nf; i++) {
printf "%s%s", $(f[i]), (i<nf ? OFS : ORS)
}
}
$ awk -f tst.awk list file
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB
Upvotes: 0
Reputation: 991
I would recommand using csvkit. Csvkit it build for that job, and work properly if some of data are strings in double quote contaning ','.
Install :
sudo apt python3-csvkit
Use
csvcut source.csv -c $(cat cols.txt | tr '\n' ',' | sed 's/,$//')
The -c option take the names of the columns, tr is used to replace the character '\n' by a ','. And since, we don't want our arguments to finish by a ',' we use sed to remove it.
Upvotes: 1
Reputation: 67567
a non-awk solution
$ cut -d, -f1,$(grep -Ff columns <(sed 1q file | tr ',' '\n' | nl -w1) | cut -f1 | paste -sd,) file
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB
or
awk
to the rescue!
$ awk 'NR==FNR {cols[$1]; next}
FNR==1 {for(i=2;i<=NF;i++) if($i in cols) colin[i]}
{line=$1;
for(i=1;i<=NF;i++) if(i in colin) line=line FS $i;
print line}' columns FS=, file
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB
Upvotes: 2