Reputation: 1097
I have a file (data.rdb
) with the following format:
col1 col2 col3 col4 col5 col6 col7
aaa1 bbb1 ccc1 ddd1 eee1 fff1 ggg1
aaa2 bbb2 ccc2 ddd2 eee2 fff2 ggg2
aaa3 bbb3 ccc3 ddd3 eee3 fff3 ggg3
Some properties of the data:
I need to extract some of the columns by name using bash, e.g., col1
, col3
and col6
, where the columns to select come from a shell variable defined as COLUMN_LIST=$@
where $@
are the parameters passed to my shell script. The number and name of the parameters might change each time I call the script.
The script needs to be in bash, cannot be python or similar.
Any ideas? I thought about using awk
/gawk
, but I do not know how to select by column name. The column order might change from file to file.
Thanks Jorge
UPDATE
for some reason, none of these solutions seem to work on my real data files (i.e., I get no output at all), so I am posting a subset of one of those:
date star jdb texp
2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 HD217987 2400000.23551544 900.
2013-11-22 TOI-134 2400000.23551544 900.
2013-11-22 tauCet 2400000.23551544 60.
2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300.
in this case, I would be interested in columns star
jdb
and texp
UPDATE 2
I've used @EdMorton's code and this is the result:
date star jdb texp date star jdb texp
2013-11-22 epsInd 2400000.23551544 100. 2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 epsInd 2400000.23551544 100. 2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 epsInd 2400000.23551544 100. 2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 HD217987 2400000.23551544 900. 2013-11-22 HD217987 2400000.23551544 900.
2013-11-22 TOI-134 2400000.23551544 900. 2013-11-22 TOI-134 2400000.23551544 900.
2013-11-22 tauCet 2400000.23551544 60. 2013-11-22 tauCet 2400000.23551544 60.
2013-11-22 BD+01316 2400000.23551544 300. 2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300. 2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300. 2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300. 2013-11-22 BD+01316 2400000.23551544 300.
UPDATE 3
I ended up using EdMorton's version of awk
-- mainly for flexibility on the output -- but with the modification that I do not want it to output wrong columns:
BEGIN {
numCols = split(column_list,cols)
OFS="\t"
}
{ sub(/\r$/,"") }
NR==1 {
for (fldNr=1; fldNr<=NF; fldNr++) {
f[$fldNr] = fldNr
}
}
{
for (colNr=1; colNr<=numCols; colNr++) {
colName = cols[colNr]
colVal = (colName in f ? $(f[colName]) : "")
printf "%s%s", colVal, (colNr<numCols ? OFS : ORS)
}
}
The main issue I got was that the header line was not tab separated and as such column breakdown did not work. An easy way to spot tab/non-tab characters:
tr $'\t' '#' < data.rdb | head -2
which gave on one of my test files:
date star jdb texp
2013-11-22#epsInd#2400000.23551544#100.
Upvotes: 2
Views: 1830
Reputation: 47249
You could do it with coreutils
. Assuming you have a file callef cols
containing the desired columns, e.g.:
col2
col3
col6
You can extract column numbers like this:
head -n1 infile | tr '\t' '\n' | grep -nf cols | cut -d: -f1 | paste -sd,
Output:
2,3,6
Pass this to cut
, e.g.:
cut -f $(head -n1 infile | tr '\t' '\n' | grep -nf cols | cut -d: -f1 | paste -sd,) infile
Output:
col2 col3 col6
bbb1 ccc1 fff1
bbb2 ccc2 fff2
bbb3 ccc3 fff3
Upvotes: 0
Reputation: 786359
The column order might change from file to file.
You may use this approach using awk
that takes space separated header column names as input and converts it into column number first by processing first record. Once required column numbers are retrieved we just print them from next row onwards.
awk -v cols='col1 col3 col6' 'BEGIN {
FS=OFS="\t"
n = split(cols, a, " ")
for (i=1; i <= n; i++)
c[a[i]]
}
{
sub(/\r$/, "")
}
NR == 1 {
for (i=1; i<=NF; i++)
if ($i in c)
hdr[i]
}
{
for (i=1; i<=NF; i++)
if (i in hdr)
s = sprintf(s "%s%s", OFS, $i)
sub(OFS, "", s)
print s
s =""
} ' file | column -t
star jdb texp
epsInd 2400000.23551544 100.
epsInd 2400000.23551544 100.
epsInd 2400000.23551544 100.
HD217987 2400000.23551544 900.
TOI-134 2400000.23551544 900.
tauCet 2400000.23551544 60.
BD+01316 2400000.23551544 300.
BD+01316 2400000.23551544 300.
BD+01316 2400000.23551544 300.
BD+01316 2400000.23551544 300.
PS: Added column -t
to format output in tabular format.
Upvotes: 3
Reputation: 204731
The best way to deal with this is to create an array (f[]
below) that maps the column header strings (i.e. the field names) to the field numbers when reading the header line and then just access the fields by their names from then on.
Updated to protect against the caller asking for a column name that doesn't exist and against DOS line endings:
$ cat tst.awk
BEGIN {
numCols = split(column_list,cols)
FS=OFS="\t"
}
{ sub(/\r$/,"") }
NR==1 {
for (fldNr=1; fldNr<=NF; fldNr++) {
f[$fldNr] = fldNr
}
}
{
for (colNr=1; colNr<=numCols; colNr++) {
colName = cols[colNr]
colVal = (colName in f ? $(f[colName]) : (NR>1 ? "N/A" : colName))
printf "%s%s", colVal, (colNr<numCols ? OFS : ORS)
}
}
$ awk -v column_list='col1 col3 col6' -f tst.awk data.rdb
col1 col3 col6
aaa1 ccc1 fff1
aaa2 ccc2 fff2
aaa3 ccc3 fff3
$ awk -v column_list='col1 col3 col6 bob' -f tst.awk data.rdb
col1 col3 col6 bob
aaa1 ccc1 fff1 N/A
aaa2 ccc2 fff2 N/A
aaa3 ccc3 fff3 N/A
Note that with the above approach if you like you can change the order of the columns for output, not just print them in their original order:
$ awk -v column_list='col5 col2 col4' -f tst.awk data.rdb
col5 col2 col4
eee1 bbb1 ddd1
eee2 bbb2 ddd2
eee3 bbb3 ddd3
Upvotes: 2