jorgehumberto
jorgehumberto

Reputation: 1097

Extract columns from tab separated file

I have a file (data.rdb) with the following format:

col1    col2    col3    col4    col5    col6    col7
aaa1    bbb1    ccc1    ddd1    eee1    fff1    ggg1
aaa2    bbb2    ccc2    ddd2    eee2    fff2    ggg2
aaa3    bbb3    ccc3    ddd3    eee3    fff3    ggg3

Some properties of the data:

I need to extract some of the columns by name using bash, e.g., col1, col3 and col6, where the columns to select come from a shell variable defined as COLUMN_LIST=$@ where $@ are the parameters passed to my shell script. The number and name of the parameters might change each time I call the script.

The script needs to be in bash, cannot be python or similar.

Any ideas? I thought about using awk/gawk, but I do not know how to select by column name. The column order might change from file to file.

Thanks Jorge

UPDATE

for some reason, none of these solutions seem to work on my real data files (i.e., I get no output at all), so I am posting a subset of one of those:

date    star    jdb texp
2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  HD217987 2400000.23551544   900.
2013-11-22  TOI-134  2400000.23551544   900.
2013-11-22  tauCet   2400000.23551544   60. 
2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.

in this case, I would be interested in columns star jdb and texp

UPDATE 2

I've used @EdMorton's code and this is the result:

date    star    jdb texp    date    star    jdb texp
2013-11-22  epsInd   2400000.23551544   100.    2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.    2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.    2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  HD217987 2400000.23551544   900.    2013-11-22  HD217987 2400000.23551544   900.
2013-11-22  TOI-134  2400000.23551544   900.    2013-11-22  TOI-134  2400000.23551544   900.
2013-11-22  tauCet   2400000.23551544   60.     2013-11-22  tauCet   2400000.23551544   60. 
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.

UPDATE 3

I ended up using EdMorton's version of awk -- mainly for flexibility on the output -- but with the modification that I do not want it to output wrong columns:

BEGIN {
    numCols = split(column_list,cols)
    OFS="\t"
}
{ sub(/\r$/,"") }
NR==1 {
    for (fldNr=1; fldNr<=NF; fldNr++) {
        f[$fldNr] = fldNr
    }
}
{
    for (colNr=1; colNr<=numCols; colNr++) {
        colName = cols[colNr]
        colVal  = (colName in f ? $(f[colName]) : "")
        printf "%s%s", colVal, (colNr<numCols ? OFS : ORS)
    }
}

The main issue I got was that the header line was not tab separated and as such column breakdown did not work. An easy way to spot tab/non-tab characters:

tr $'\t' '#' < data.rdb | head -2

which gave on one of my test files:

date    star    jdb texp
2013-11-22#epsInd#2400000.23551544#100.

Upvotes: 2

Views: 1830

Answers (3)

Thor
Thor

Reputation: 47249

You could do it with coreutils. Assuming you have a file callef cols containing the desired columns, e.g.:

col2
col3
col6

You can extract column numbers like this:

head -n1 infile | tr '\t' '\n' | grep -nf cols | cut -d: -f1 | paste -sd,

Output:

2,3,6

Pass this to cut, e.g.:

cut -f $(head -n1 infile | tr '\t' '\n' | grep -nf cols | cut -d: -f1 | paste -sd,) infile

Output:

col2    col3    col6
bbb1    ccc1    fff1
bbb2    ccc2    fff2
bbb3    ccc3    fff3

Upvotes: 0

anubhava
anubhava

Reputation: 786359

The column order might change from file to file.

You may use this approach using awk that takes space separated header column names as input and converts it into column number first by processing first record. Once required column numbers are retrieved we just print them from next row onwards.

awk -v cols='col1 col3 col6' 'BEGIN {
   FS=OFS="\t"
   n = split(cols, a, " ")
   for (i=1; i <= n; i++)
      c[a[i]]
}
{
   sub(/\r$/, "")
}
NR == 1 {
   for (i=1; i<=NF; i++)
      if ($i in c)
         hdr[i]
}
{
   for (i=1; i<=NF; i++)
      if (i in hdr)
         s = sprintf(s "%s%s", OFS, $i)
   sub(OFS, "", s)
   print s
   s =""
} ' file | column -t

star      jdb               texp
epsInd    2400000.23551544  100.
epsInd    2400000.23551544  100.
epsInd    2400000.23551544  100.
HD217987  2400000.23551544  900.
TOI-134   2400000.23551544  900.
tauCet    2400000.23551544  60.
BD+01316  2400000.23551544  300.
BD+01316  2400000.23551544  300.
BD+01316  2400000.23551544  300.
BD+01316  2400000.23551544  300.

PS: Added column -t to format output in tabular format.

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 204731

The best way to deal with this is to create an array (f[] below) that maps the column header strings (i.e. the field names) to the field numbers when reading the header line and then just access the fields by their names from then on.

Updated to protect against the caller asking for a column name that doesn't exist and against DOS line endings:

$ cat tst.awk
BEGIN {
    numCols = split(column_list,cols)
    FS=OFS="\t"
}
{ sub(/\r$/,"") }
NR==1 {
    for (fldNr=1; fldNr<=NF; fldNr++) {
        f[$fldNr] = fldNr
    }
}
{
    for (colNr=1; colNr<=numCols; colNr++) {
        colName = cols[colNr]
        colVal  = (colName in f ? $(f[colName]) : (NR>1 ? "N/A" : colName))
        printf "%s%s", colVal, (colNr<numCols ? OFS : ORS)
    }
}

$ awk -v column_list='col1 col3 col6' -f tst.awk data.rdb
col1    col3    col6
aaa1    ccc1    fff1
aaa2    ccc2    fff2
aaa3    ccc3    fff3

$ awk -v column_list='col1 col3 col6 bob' -f tst.awk data.rdb
col1    col3    col6    bob
aaa1    ccc1    fff1    N/A
aaa2    ccc2    fff2    N/A
aaa3    ccc3    fff3    N/A

Note that with the above approach if you like you can change the order of the columns for output, not just print them in their original order:

$ awk -v column_list='col5 col2 col4' -f tst.awk data.rdb
col5    col2    col4
eee1    bbb1    ddd1
eee2    bbb2    ddd2
eee3    bbb3    ddd3

Upvotes: 2

Related Questions