Reputation: 87

How to extract 45,000 columns from a space separated file with 500,000 columns according to a list of column name file using awk

a.txt contains 500,000 columns and 2000 rows. The example file below only shows the first 9 columns in this file. This file has header in the first row.

chromosome SNPID rsid position alleleA alleleB 2409086 3514581 3635346
1 1:55487346_C_G rs12117661 55487346 C G 1 0 0
1 1:55487648_A_G rs11588151 55487648 A G 1 0 0
1 1:55489542_C_T rs34232196 55489542 C T 1 0 0
1 1:55490861_T_C rs4500361 55490861 T C 1 0 0
1 1:55491702_T_C rs4927191 55491702 T C 0.894118 0 0
1 1:55491780_A_G rs200159426 55491780 A G 0.894118 0 0

b.txt contains 45000 columns which show the column name for each line. I want to extract the columns from a.txt according to b.txt.

chromosome
SNPID
rsid
position
alleleA
alleleB
2409086
3635346

c.txt is my expected outcome. c.txt should be a space separated table with 45000 columns and 2000 rows.

chromosome SNPID rsid position alleleA alleleB 2409086 3635346
1 1:55487346_C_G rs12117661 55487346 C G 1 0
1 1:55487648_A_G rs11588151 55487648 A G 1 0
1 1:55489542_C_T rs34232196 55489542 C T 1 0
1 1:55490861_T_C rs4500361 55490861 T C 1 0
1 1:55491702_T_C rs4927191 55491702 T C 0.894118 0
1 1:55491780_A_G rs200159426 55491780 A G 0.894118 0

I tried to use cut to solve this problem, but it shows that argument list too long (since I need to extract 45000 columns). I know awk may solve this problem but I am not familiar with awk and did not find any answer about that. Does any body have solution for it?

cut -f 1,$(
    head -n1 a.txt |
    tr ' ' '\n' |
    grep -nf b.txt |
    sed ':a;$!N;s/:[^\n]*\n/,/;ta;s/:.*//'
) a.txt > c.txt
-bash: /usr/bin/cut: Argument list too long

Update: Thank you guys for the great solutions.

Upvotes: 6

Answers (5)

Fravadona

Reputation: 17178

You can use Miller (available here for many OSs) for this task.

Unfortunately, b.txt is not in the format that Miller expects, so you'll first have to transpose it to a single-line CSV, for example with:

^{edit: thanks @glennjackman for the simpler solution:}

paste -s -d ',' b.txt > b.CSV

Now you just have to pass it to Miller as "template" file:

mlr --csvlite --ifs=' ' template -t b.CSV a.txt

^{remark: I tried to provide the template file as a process substitution <(...) instead of a regular file but it doesn't work.}

Upvotes: 1

RARE Kpop Manifesto

Reputation: 2855

you don't need loops or python or anything messy :

I created a random file, and wanted to get the column names ending with lower case letters, which generated a custom awk command on the fly by only reading the first column of the data file, and a single pass for both the input files.

 echo
 gdate
 echo

 f='testtest_colnames.txt'

 gcat -n "${f}"

 ___="$( mawk '  BEGIN   { printf("{ print ") 
              }  END     { printf(" }") 
              } /[a-z]$/ { printf("%.*s $%.u", !!_++, ",", NR) }' "${f}" )"

  gawk -p- -be "${___}" /dev/null

Tue Dec  6 06:26:17 EST 2022


     1  QRSTUV
     2  ABCDEF
     3  OPQRST

     4  cdefgh
     5  IJKLMN
     6  ghijkl

     7  efghij
     8  uvwxyz
     9  qrstuv

    10  wxyzAB
    11  stuvwx
    12  wxyz

    13  KLMNOP
    14  yzABCD
    15  ijklmn

    16  MNOPQR
    17  abcdef
    18  UVWXYZ

    19  YZabcd
    20  opqrst
    21  klmnop

    22  mnopqr
    23  STUVWX
    24  GHIJKL

    25  EFGHIJ
    26  WXYZab
    27  CDEFGH

    # gawk profile, created Tue Dec  6 06:26:17 2022

    # Rule(s)

    {
        print $4, $6, $7, $8, $9, $11, $12, $15, $17, $19, $20, $21, $22, $26
    }

Upvotes: 0

Daweo

Reputation: 36590

I suggest trying GNU datamash if you are allowed to install and use tools other than awk, though I am not sure how it would deal with such big file and so many columns, it has cut operation named so after cut command, but unlike it does understand headers and keep order, simple example let file.txt content be

alleleA alleleB alleleC
A C G
T A C
G T A

then

datamash --field-separator=' ' --headers cut "alleleC,alleleA" < file.txt

gives output

cut(alleleC) cut(alleleA)
G A
C T
A G

Observe that order provided was applied and cut appeared in output headers as this was action done, if this is not acceptable you might elect to remove cut( and ) using e.g. sed as long as there are not brackets in column names.

(tested in GNU datamash 1.7)

Upvotes: 0

Ed Morton

Reputation: 203899

Using any awk:

$ cat tst.awk
NR == FNR {
    out2tag[++numOutFlds] = $1
    next
}
FNR==1 {
    for ( inFldNr=1; inFldNr<=NF; inFldNr++ ) {
        tag2in[$inFldNr] = inFldNr
    }
}
{
    for ( outFldNr=1; outFldNr<=numOutFlds; outFldNr++ ) {
        tag     = out2tag[outFldNr]
        inFldNr = tag2in[tag]
        printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
    }
}

$ awk -f tst.awk b.txt a.txt
chromosome SNPID rsid position alleleA alleleB 2409086 3635346
1 1:55487346_C_G rs12117661 55487346 C G 1 0
1 1:55487648_A_G rs11588151 55487648 A G 1 0
1 1:55489542_C_T rs34232196 55489542 C T 1 0
1 1:55490861_T_C rs4500361 55490861 T C 1 0
1 1:55491702_T_C rs4927191 55491702 T C 0.894118 0
1 1:55491780_A_G rs200159426 55491780 A G 0.894118 0

Three things to note with this approach:

It only loops as many times per input line as the number of fields you want to output, 45,000 in this case (as opposed to looping through all 500,000 of the input fields and discarding 455,000 of them). This would be a noticeable performance improvement over @GlennJackman's first solution but the same as their 2nd solution.
The output fields will be printed in the same order as the lines in b.txt so you don't have to produce output in the same order as the input fields in a.txt. This is the main functional difference between the above and @GlennJackmans 2nd solution, theirs will also be very slightly faster as mine requires 1 extra hash lookup per field.
It won't produce an extra blank char at the end of each output line (a common issue with such solutions).

Upvotes: 2

glenn jackman

Reputation: 247002

With awk

Suppose this is filter.awk

NR == FNR { # reading the first file
    wanted[$1] = 1
    next
}

FNR == 1 {
    for (i=1; i<=NF; i++) {
        header[i] = $i
    }
}

{
    for (i=1; i<=NF; i++) {
        if (header[i] in wanted) {
            printf "%s ", $i
        }
    }
    print ""
}

Then, given your sample a.txt and

$ cat b.txt
chromosome
rsid
2409086

we get

$ awk -f filter.awk b.txt a.txt
chromosome rsid 2409086
1 rs12117661 1
1 rs11588151 1
1 rs34232196 1
1 rs4500361 1
1 rs4927191 0.894118
1 rs200159426 0.894118
...

This will be a bit quicker: it doesn't have to iterate over all the columns for each record

NR == FNR { # reading the first file
    wanted[$1] = 1
    next
}

FNR == 1 {
    n = 0
    for (i=1; i<=NF; i++) {
        if ($i in wanted) {
            cols_to_print[++n] = i
        }
    }
}

{
    for (i=1; i<=n; i++) printf "%s ", $(cols_to_print[i])
    print ""
}

Upvotes: 4

How to extract 45,000 columns from a space separated file with 500,000 columns according to a list of column name file using awk

Answers (5)

Related Questions