bactro
bactro

Reputation: 79

How to extract columns from data and make new named files

I'm trying to extract specific columns from a dataset I have that has over 200 headers (columns). I'd like to get the first 5 columns (CHROM ... ALT), plus only one of columns H001 down to H231 in separate files. I've only given examples of the headers of my file because some data are quite large. Preferably, I'd like the data files produced to have the names of their columns, for example H001.txt (which is columns 1 to 5, plus only column H001). I'm new to bash scripting, and am a bit confused by how variables can be used. Thanks!

These are headers in my file, the data can be anything but removed for clarity.

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  H001    H002    H003    H004    H005  ... H231

The code I've tried will just copy the whole dataset instead of each column, and I'm struggling to find a way to name files after the columns.

#!/bin/bash


headers=$(head -n 1 myData.txt)
for i in $(seq 10 231); do
awk '{ print $1, $2, $3, $4, $5, $i }' FS='\t' myData.txt > "$i".txt
done

My desired output should look like this:

File H001.txt

CHROM  POS     ID      REF     ALT H001

File H002.txt

CHROM  POS     ID      REF     ALT H002

and so on down for each column to H231.

Upvotes: 0

Views: 320

Answers (1)

William Pursell
William Pursell

Reputation: 212238

You want to move your redirections inside awk. For example,

awk '{for(i=10;i<=231;i++) { file=sprintf("H%03d.txt", i); print $1, $2, $3, $4, $5, $i >> file; close file }}' myData.txt

Note that if your column count gets too high, you'll run into limits on the number of open files, so I'm closing the file on each iteration. You can probably omit the close file and use print ... > file if the column count is sufficiently small.

To use the values in the header line in the filenames, you could do something like:

awk 'NR==1{ split($0, hdr) } 
    NR > 1 { for(i=9;i<=12;i++) 
        {print $1, $2, $3, $4, $5, $i >> hdr[i] } 
    }' myData.txt 

Upvotes: 1

Related Questions