July
July

Reputation: 83

Unix: filtering CSV columns according to external file

I am working on a large csv file (millions of rows and 80 thousand columns). I want to extract and save in a new file all rows and only those columns that are listed in an external text file. For instance:

Source data file

id,snp1,snp2,snp3,snp4,snp5,snp6,snp7,snp8,snp9,snp10
sampl1,AA,BB,AB,BB,AA,AA,AB,BB,BB,BB
sampl2,AA,BB,BB,BB,AB,AA,AB,BB,BB,BB
sampl3,AA,BB,AB,BB,BB,AA,AA,BB,BB,BB
sampl4,AA,BB,AA,BB,AB,AA,BB,BB,BB,BB
sampl5,AA,BB,AB,BB,AB,AA,AA,BB,BB,BB
sampl6,AA,BB,AB,BB,BB,AA,AB,BB,BB,BB
sampl7,AA,BB,BB,AB,AB,AA,AB,BB,BB,BB

External file with list of columns to keep-

snp3
snp6
snp7
snp10

Resulting (new) file

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

Is there an efficient approach to do that using awk?

Upvotes: 0

Views: 107

Answers (3)

Ed Morton
Ed Morton

Reputation: 204548

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
    list["id"]
    list[$0]
    next
}
FNR==1 {
    for (i=1; i<=NF; i++) {
        if ($i in list) {
            f[++nf] = i
        }
    }
}
{
    for (i=1; i<=nf; i++) {
        printf "%s%s", $(f[i]), (i<nf ? OFS : ORS)
    }
}

$ awk -f tst.awk list file
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

Upvotes: 0

Olivier Lasne
Olivier Lasne

Reputation: 991

I would recommand using csvkit. Csvkit it build for that job, and work properly if some of data are strings in double quote contaning ','.

Install :

sudo apt python3-csvkit

Use

 csvcut source.csv -c  $(cat cols.txt | tr '\n' ',' | sed 's/,$//')

The -c option take the names of the columns, tr is used to replace the character '\n' by a ','. And since, we don't want our arguments to finish by a ',' we use sed to remove it.

Upvotes: 1

karakfa
karakfa

Reputation: 67567

a non-awk solution

$ cut -d, -f1,$(grep -Ff columns <(sed 1q file | tr ',' '\n' | nl -w1) | cut -f1 | paste -sd,) file

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

or

awk to the rescue!

$ awk 'NR==FNR {cols[$1]; next}
       FNR==1  {for(i=2;i<=NF;i++) if($i in cols) colin[i]}
               {line=$1;
                for(i=1;i<=NF;i++) if(i in colin) line=line FS $i; 
                print line}' columns FS=, file

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

Upvotes: 2

Related Questions