Reputation: 79
I'm trying to extract specific columns from a dataset I have that has over 200 headers (columns). I'd like to get the first 5 columns (CHROM ... ALT), plus only one of columns H001 down to H231 in separate files. I've only given examples of the headers of my file because some data are quite large. Preferably, I'd like the data files produced to have the names of their columns, for example H001.txt (which is columns 1 to 5, plus only column H001). I'm new to bash scripting, and am a bit confused by how variables can be used. Thanks!
These are headers in my file, the data can be anything but removed for clarity.
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT H001 H002 H003 H004 H005 ... H231
The code I've tried will just copy the whole dataset instead of each column, and I'm struggling to find a way to name files after the columns.
#!/bin/bash
headers=$(head -n 1 myData.txt)
for i in $(seq 10 231); do
awk '{ print $1, $2, $3, $4, $5, $i }' FS='\t' myData.txt > "$i".txt
done
My desired output should look like this:
File H001.txt
CHROM POS ID REF ALT H001
File H002.txt
CHROM POS ID REF ALT H002
and so on down for each column to H231.
Upvotes: 0
Views: 320
Reputation: 212238
You want to move your redirections inside awk. For example,
awk '{for(i=10;i<=231;i++) { file=sprintf("H%03d.txt", i); print $1, $2, $3, $4, $5, $i >> file; close file }}' myData.txt
Note that if your column count gets too high, you'll run into limits on the number of open files, so I'm closing the file on each iteration. You can probably omit the close file
and use print ... > file
if the column count is sufficiently small.
To use the values in the header line in the filenames, you could do something like:
awk 'NR==1{ split($0, hdr) }
NR > 1 { for(i=9;i<=12;i++)
{print $1, $2, $3, $4, $5, $i >> hdr[i] }
}' myData.txt
Upvotes: 1