aan
aan

Reputation: 89

match pattern and print corresponding columns from a file using awk or grep

I have a input file with repetitive headers (below):

A1BG A1BG A1CF A1CF A2ML1
aa bb cc dd ee
1 2 3 4 5

I want to print all columns with same header in one file. e.g for above file there should be three output files; 1 for A1BG with 2 columns; 2nd for A1CF with 2 columns; 3rd for A2ML1 with 1 column. I there any way to do it using one-liners by awk or grep?

I tried following one-liner:

awk -v f="A1BG" '!o{for(x=1;x<=NF;x++)if($x==f){o=1;next}}o{print $x}' trial.txt

but this searches the pattern in only one column (1 in this case). I want to look through all the header names and print all the corresponding columns which have A1BG in their header.

Upvotes: 0

Views: 1657

Answers (6)

aan
aan

Reputation: 89

Following worked for me:

code for step1.awk:

NR == 1 { PROCINFO["sorted_in"] = "@ind_num_asc" for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " \"\t\" $" i) } } NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn]) exit }

Then run one liner which uses above awk script:

awk -f step1.awk file.txt | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1".txt" "\"" }; END { print "}" }'| awk -f - file.txt

This outputs tab delimited .txt files having all the columns with same header in one file. (separate files for each type of header)

Thanks Lars Fischer and others.

Cheers

Upvotes: 0

Lars Fischer
Lars Fischer

Reputation: 10149

Since you wrote in one of the comments to my other answer that you have 20000 columns, lets consider a two step approach to ease debugging to find out which of the steps breaks.

step1.awk

  NR == 1 { PROCINFO["sorted_in"] = "@ind_num_asc"
            for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " $" i) } }
  NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn]) 
           exit
        }

Step1 should give us a list of files together with their columns:

> awk -f step1.awk yourfile
Mpap_1:$1, $2, $3, $5, $13, $19, $25
Mpap_2:$4, $6, $8, $12, $14, $16, $20, $22, $26, $28
Mpap_3:$7, $9, $10, $11, $15, $17, $18, $21, $23, $24, $27, $29, $30

In my test data Mpap_1 is the header in column 1,2,3,5,13,19,25. Lets hope that this first step works with your large set of columns. (To be frank: I dont know if awk can deal with $20000.)

Step 2: lets create one of those famous one liners:

> awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print "  print " $2, "> \""  $1 "\""  }; END { print "}" }' | awk -v "OFS=\t" -f - yourfile 

The first part is our step 1, the second part builds on-the-fly a second awk script, with lines like this: print $1, $2, $3, $5, $13, $19, $25 > "Mpap_1". This second awk script is piped to the third part, which read the script from stdin (-f -) and applies the script to your input file.

In case something does not work: watch the output of each part of step2, you can execute the parts from the left up to (but not including) each of the | symbols and see what is going on, e.g.:

  • awk -f step1.awk yourfile
  • awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }'

Upvotes: 0

Lars Fischer
Lars Fischer

Reputation: 10149

I cannot help you with a 1-liner but here is a 10-liner for GNU awk:

script.awk

  NR == 1 { PROCINFO["sorted_in"] = "@ind_num_asc"
            for( i=1; i<=NF; i++ ) { f2c[$i] = (i==1)? i : f2c[$i] " " i } }
        { for( n in f2c ) { 
              split( f2c[n], fls, " ")
              tmp = ""
              for( f in fls ) tmp = (f ==1) ? $fls[f] : tmp "\t" $fls[f]
              print tmp > n
          }
        }

Use it like this: awk -f script.awk your_file

In the first action: it determines filenames from the columns in the first record (NR == 1).

In the second action: for each record: for each output file: its columns (as defined in the first record) are collected into tmp and written to the output file.

The use of PROCINFO requires GNU awk, see Ed Mortons comments for alternatives.

Example run and ouput:

> awk -f mpapccfaf.awk mpapccfaf.csv 
> cat A1BG 
A1BG    A1BG
aa      bb
1       2

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203522

Here y'go, a one-liner as requested:

awk 'NR==1{for(i=1;i<=NF;i++)a[$i][i]}{PROCINFO["sorted_in"]="@ind_num_asc";for(n in a){c=0;for(f in a[n])printf"%s%s",(c++?OFS:""),$f>n;print"">n}}' file

The above uses GNU awk 4.* for true multi-dimensional arrays and sorted_in.

For anyone else reading this who prefers clarity over the brevity the OP needs, here it is as a more natural multi-line script:

$ cat tst.awk
NR==1 {
    for (i=1; i<=NF; i++) {
        names2fldNrs[$i][i]
    }
}
{
    PROCINFO["sorted_in"] = "@ind_num_asc"
    for (name in names2fldNrs) {
        c = 0
        for (fldNr in names2fldNrs[name]) {
            printf "%s%s", (c++ ? OFS : ""), $fldNr > name
        }
        print "" > name
    }
}

$ awk -f tst.awk file

$ cat A1BG
A1BG A1BG
aa bb
1 2

$ cat A1CF
A1CF A1CF
cc dd
3 4

$ cat A2ML1
A2ML1
ee

Upvotes: 0

pakistanprogrammerclub
pakistanprogrammerclub

Reputation: 827

awk solution should be pretty fast - output files are tab-delimited and named cols.A1BG cols.A1CF etc

awk '
# fill cols columns map to header and tab map to track tab state per header
NR==1 {
  for(i=1; i<=NF; ++i) {
    cols[i]=$i
    tab[$i]=0
  }
}
{
# reset tab state for every header
  for(h in tab) tab[h]=0
# write tab-delimited column to its cols.header file
  for(i=1; i<=NF; ++i) {
    hdr=cols[i]
    of="cols." hdr
    if(tab[hdr]) {
      printf("\t") >of
    } else
      tab[hdr]=1
    printf("%s", $i) >of
  }
# newline for every header file
  for(h in tab) {
    of="cols." h
    printf("\n") >of
  }
}
'

This is the output from both of my awk solutions:

$ ./scr.sh <in.txt; head cols.*
==> cols.A1BG <==
A1BG    A1BG
aa      bb
1       2

==> cols.A1CF <==
A1CF    A1CF
cc      dd
3       4

==> cols.A2ML1 <==
A2ML1
ee
5

Upvotes: 1

pakistanprogrammerclub
pakistanprogrammerclub

Reputation: 827

This awk solution takes the same approach as Lars but uses gawk 4.0 2D arrays

awk '
# fill cols map of header to its list of columns
NR==1 {
  for(i=1; i<=NF; ++i) {
    if(!($i in cols))
      j=0
    cols[$i][j++]=i
  }
}
{
# write tab-delimited columns for each header to its cols.header file
  for(h in cols) {
    of="cols."h
    for(i=0; i < length(cols[h]); ++i) {
      if(i > 0) printf("\t") >of
      printf("%s", $cols[h][i]) >of
    }
    printf("\n") >of
  }
}
'

Upvotes: 1

Related Questions