user3631848
user3631848

Reputation: 483

How to split column by matching header?

I'm thinking if there is a way to split the column by matching the header ?

The data looks like this

         ID_1   ID_2   ID_3   ID_6   ID_15
value1   0      2      4      7      6
value2   0      4      4      3      8
value3   2      2      3      7      8

I would like to get the columns only on ID_3 & ID_15

ID_3   ID_15
4      6
4      8
3      8

awk can simply separate it if I know the order of the column However, I have a very huge table and only have a list of ID in hands.
Can I still use awk or there is an easier way in linux ?

Upvotes: 0

Views: 357

Answers (4)

mr.spuratic
mr.spuratic

Reputation: 10107

The input format isn't well defined, but there are a few simple ways, awk, perl and sqlite.

(FNR==1) {
    nocol=split(col,ocols,/,/)    # cols contains named columns
    ncols=split("vals " $0,cols)  # header line
    for (nn=1; nn<=ncols; nn++) colmap[cols[nn]]=nn  # map names

    OFS="\t"                      # to align output
    for (nn=1; nn<=nocol; nn++) printf("%s%s",ocols[nn],OFS)
    printf("\n")                  # output header line 
}
(FNR>1) { # read data
    for (nn=1; nn<=nocol; nn++)  {
        if (nn>1) printf(OFS)     # pad
        if (ocols[nn] in colmap) { printf("%s",$(colmap[ocols[nn]])) }
        else                     { printf "--" } # named column not in data
    }  
    printf("\n") # wrap line
}

$ nawk -f mycols.awk -v col=ID_3,ID_15 data
ID_3    ID_15   
4       6
4       8
3       8

Perl, just a variation on the above with some perl idioms to confuse/entertain:

use strict;
use warnings;

our @ocols=split(/,/,$ENV{cols}); # cols contains named columns
our $nocol=scalar(@ocols);
our ($nn,%colmap);
$,="\t";                          # OFS equiv

# while (<>) {...} implicit with perl -an
if ($. == 1) {  # FNR equiv
    %colmap = map { $F[$_] => $_+1 } 0..$#F ; # create name map hash
    $colmap{vals}=0;                          # name anon 1st col
    print @ocols,"\n";                        # output header
} else {
    for ($nn = 0; $nn < $nocol; $nn++) {
       print "\t" if ($nn>0);
       if (exists($colmap{$ocols[$nn]})) { printf("%s",$F[$colmap{$ocols[$nn]}]) }
       else                              { printf("--") } # named column not in data 
    }
    printf("\n")
}

$ cols="ID_3,ID_15" perl -an mycols.pl < data

That uses an environment variable to skip effort parsing the command line. It needs the perl options -an which set up field-splitting and an input read loop (much like awk does).


And with sqlite (I used v3.11, v3.8 or later is required for useful .import I believe). This uses an in-memory temporary database (name a file if too large for memory, or for a persistent copy of the parsed data), and automatically creates a table based on the first line. The advantages here are that you might not need any scripting at all, and you can perform multiple queries on your data with just one parse overhead.

You can skip this next step if you have a single hard-tab delimiting the columns, in which case replace .mode csv with .mode tab in the sqlite example below. Otherwise, to convert your data to a suitable CSV-ish format:

nawk -v OFS="," '(FNR==1){$0="vals " $0} {$1=$1;print} < data > data.csv

This adds a dummy first column "vals" to the first line, then prints each line as comma-separated, it does this by a seemingly pointless assignment to $1, but this causes $0 to be recomputed replacing FS (space/tab) with OFS (comma).

$ sqlite3
sqlite> .mode csv 
sqlite> .import data.csv mytable
sqlite> .schema mytable
CREATE TABLE mytable(
  "vals" TEXT,
  "ID_1" TEXT,
  "ID_2" TEXT,
  "ID_3" TEXT,
  "ID_6" TEXT,
  "ID_15" TEXT
);
sqlite> select ID_3,ID_15 from mytable;
ID_3,ID_15
4,6
4,8
3,8
sqlite> .mode column
sqlite> select ID_3,ID_15 from mytable;
ID_3        ID_15     
----------  ----------
4           6         
4           8         
3           8         

Use .once or .output to send output to a file (sqlite docs). Use .headers on or .headers off as required. sqlite is quite happy to create an unnamed column, so you don't have to add a name to the first column of the header line, but you do need to make sure the number of columns is the same for all input lines and formats.

If you get "expected X columns but found Y" errors during the .import then you'll need to clean up the data format a little for this.

Upvotes: 1

Tom Fenech
Tom Fenech

Reputation: 74615

You could go for something like this:

BEGIN { 
    keys["ID_3"]
    keys["ID_15"] 
}

NR == 1 { 
    for (i = 1; i <= NF; ++i) 
        if ($i in keys) cols[++n] = i 
}

{ 
    for (i = 1; i <= n; ++i) 
        printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS) 
}

Save the script to a file and run it like awk -f script.awk file.

Alternatively, as a "one-liner":

awk 'BEGIN { keys["ID_3"]; keys["ID_15"] } 
NR == 1 { for (i = 1; i <= NF; ++i) if ($i in keys) cols[++n] = i }
{ for (i = 1; i <= n; ++i) printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS) }' file

Before the file is processed, keys are set in the keys array, corresponding to the column headings of interest.

On the first line, record all the column numbers that contain one of the keys in the cols array.

Loop through each of the cols and print them out, followed by either the output field separator OFS or the output record separator ORS, depending on whether it's the last one. $(cols[i]+(NR>1)) handles the fact that rows after the first have an extra field at the start, because NR>1 will be true (1) for those lines and false (0) for the first line.

Upvotes: 1

sumit gupta
sumit gupta

Reputation: 3

Try below script:

 #!/bin/sh

file="$1"; shift

awk -v cols="$*" '
BEGIN{
split(cols,C)
OFS=FS="\t"
getline
split($0,H)
for(c in C){
    for(h in H){
        if(C[c]==H[h])F[i++]=h
    }
}
}
{ l="";for(f in F){l=l $F[f] OFS}print l }

' "$file"

In command line type:

[sumit.gupta@rpm01 ~]$ test.sh filename ID_3 ID_5

Upvotes: 0

jas
jas

Reputation: 10865

$ cat c.awk
NR == 1 {
    for (i=1; i<=NF; ++i) {
        if ($i == "ID_3") col_3 = (i + 1)
        if ($i == "ID_15") col_15 = (i + 1)
    }
    print "ID_3", "ID_15"
}

NR > 1 { print $col_3, $col_15 }


$ awk -f c.awk c.txt
ID_3 ID_15
4 6
4 8
3 8

Upvotes: 1

Related Questions