Reputation: 89
I have a input file with repetitive headers (below):
A1BG A1BG A1CF A1CF A2ML1
aa bb cc dd ee
1 2 3 4 5
I want to print all columns with same header in one file. e.g for above file there should be three output files; 1 for A1BG with 2 columns; 2nd for A1CF with 2 columns; 3rd for A2ML1 with 1 column. I there any way to do it using one-liners by awk or grep?
I tried following one-liner:
awk -v f="A1BG" '!o{for(x=1;x<=NF;x++)if($x==f){o=1;next}}o{print $x}' trial.txt
but this searches the pattern in only one column (1 in this case). I want to look through all the header names and print all the corresponding columns which have A1BG in their header.
Upvotes: 0
Views: 1657
Reputation: 89
Following worked for me:
code for step1.awk:
NR == 1 { PROCINFO["sorted_in"] = "@ind_num_asc" for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " \"\t\" $" i) } } NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn]) exit }
Then run one liner which uses above awk script:
awk -f step1.awk file.txt | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1".txt" "\"" }; END { print "}" }'| awk -f - file.txt
This outputs tab delimited .txt files having all the columns with same header in one file. (separate files for each type of header)
Thanks Lars Fischer and others.
Cheers
Upvotes: 0
Reputation: 10149
Since you wrote in one of the comments to my other answer that you have 20000 columns, lets consider a two step approach to ease debugging to find out which of the steps breaks.
step1.awk
NR == 1 { PROCINFO["sorted_in"] = "@ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " $" i) } }
NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn])
exit
}
Step1 should give us a list of files together with their columns:
> awk -f step1.awk yourfile
Mpap_1:$1, $2, $3, $5, $13, $19, $25
Mpap_2:$4, $6, $8, $12, $14, $16, $20, $22, $26, $28
Mpap_3:$7, $9, $10, $11, $15, $17, $18, $21, $23, $24, $27, $29, $30
In my test data Mpap_1 is the header in column 1,2,3,5,13,19,25. Lets hope that this first step works with your large set of columns. (To be frank: I dont know if awk can deal with $20000.)
Step 2: lets create one of those famous one liners:
> awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }' | awk -v "OFS=\t" -f - yourfile
The first part is our step 1, the second part builds on-the-fly a second awk script, with lines like this: print $1, $2, $3, $5, $13, $19, $25 > "Mpap_1"
. This second awk script is piped to the third part, which read the script from stdin (-f -
) and applies the script to your input file.
In case something does not work: watch the output of each part of step2, you can execute the parts from the left up to (but not including) each of the |
symbols and see what is going on, e.g.:
awk -f step1.awk yourfile
awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }'
Upvotes: 0
Reputation: 10149
I cannot help you with a 1-liner but here is a 10-liner for GNU awk:
script.awk
NR == 1 { PROCINFO["sorted_in"] = "@ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (i==1)? i : f2c[$i] " " i } }
{ for( n in f2c ) {
split( f2c[n], fls, " ")
tmp = ""
for( f in fls ) tmp = (f ==1) ? $fls[f] : tmp "\t" $fls[f]
print tmp > n
}
}
Use it like this: awk -f script.awk your_file
In the first action: it determines filenames from the columns in the first record (NR == 1
).
In the second action: for each record: for each output file: its columns (as defined in the first record) are collected into tmp
and written to the output file.
The use of PROCINFO requires GNU awk, see Ed Mortons comments for alternatives.
Example run and ouput:
> awk -f mpapccfaf.awk mpapccfaf.csv
> cat A1BG
A1BG A1BG
aa bb
1 2
Upvotes: 1
Reputation: 203522
Here y'go, a one-liner as requested:
awk 'NR==1{for(i=1;i<=NF;i++)a[$i][i]}{PROCINFO["sorted_in"]="@ind_num_asc";for(n in a){c=0;for(f in a[n])printf"%s%s",(c++?OFS:""),$f>n;print"">n}}' file
The above uses GNU awk 4.* for true multi-dimensional arrays and sorted_in.
For anyone else reading this who prefers clarity over the brevity the OP needs, here it is as a more natural multi-line script:
$ cat tst.awk
NR==1 {
for (i=1; i<=NF; i++) {
names2fldNrs[$i][i]
}
}
{
PROCINFO["sorted_in"] = "@ind_num_asc"
for (name in names2fldNrs) {
c = 0
for (fldNr in names2fldNrs[name]) {
printf "%s%s", (c++ ? OFS : ""), $fldNr > name
}
print "" > name
}
}
$ awk -f tst.awk file
$ cat A1BG
A1BG A1BG
aa bb
1 2
$ cat A1CF
A1CF A1CF
cc dd
3 4
$ cat A2ML1
A2ML1
ee
Upvotes: 0
Reputation: 827
awk solution should be pretty fast - output files are tab-delimited and named cols.A1BG cols.A1CF etc
awk '
# fill cols columns map to header and tab map to track tab state per header
NR==1 {
for(i=1; i<=NF; ++i) {
cols[i]=$i
tab[$i]=0
}
}
{
# reset tab state for every header
for(h in tab) tab[h]=0
# write tab-delimited column to its cols.header file
for(i=1; i<=NF; ++i) {
hdr=cols[i]
of="cols." hdr
if(tab[hdr]) {
printf("\t") >of
} else
tab[hdr]=1
printf("%s", $i) >of
}
# newline for every header file
for(h in tab) {
of="cols." h
printf("\n") >of
}
}
'
This is the output from both of my awk solutions:
$ ./scr.sh <in.txt; head cols.*
==> cols.A1BG <==
A1BG A1BG
aa bb
1 2
==> cols.A1CF <==
A1CF A1CF
cc dd
3 4
==> cols.A2ML1 <==
A2ML1
ee
5
Upvotes: 1
Reputation: 827
This awk solution takes the same approach as Lars but uses gawk 4.0 2D arrays
awk '
# fill cols map of header to its list of columns
NR==1 {
for(i=1; i<=NF; ++i) {
if(!($i in cols))
j=0
cols[$i][j++]=i
}
}
{
# write tab-delimited columns for each header to its cols.header file
for(h in cols) {
of="cols."h
for(i=0; i < length(cols[h]); ++i) {
if(i > 0) printf("\t") >of
printf("%s", $cols[h][i]) >of
}
printf("\n") >of
}
}
'
Upvotes: 1