How to split big tsv file using unique column element and also keep header

Question

I have a tsv file called myfile.tsv. I want to split this file based on unique element in chr column using awk/gawk/bash or any faster command line and get chr1.tsv (header+row1), chr2.tsv (header+rows2 and 3),chrX.tsv(header+row4),chrY.tsv(header+rows5and6) and chrM.tsv(header+last row).

myfile.tsv

   chr    value       region
  chr1      223         554
  chr2      433         444 
  chr2      443         454 
  chrX      445         444 
  chrY      445         443    
  chrY      435         243
  chrM      543         544

Tom Fenech · Accepted Answer

Here's a little script that does what you're looking for:

NR == 1 {
    header = $0
    next
}

{
    outfile = "chr" $1 ".tsv"
    if (!seen[$1]++) {
        print header > outfile
    }
    print > outfile
}

The first row is saved, so it can be used later. The other lines are printed to file matching the value of the first field. The header is added if the value hasn't been seen yet.

NR is the record number, so NR == 1 is only true when the record number is one (i.e. the first line). In this block, the whole line $0 is saved to the variable header. next skips any other blocks and moves to the next line. This means that the second block (which would otherwise be run unconditionally on every record) is skipped.

For every other line in the file, the output filename is built using the value of the first field. The array seen keeps a track of values of $1. !seen[$1]++ is only true the first time a given value of $1 is seen, as the value of seen[$1] is incremented every time it is checked. If the value of $1 has not yet been seen, the header is written to the output file.

Every line is written to the output file.

How to split big tsv file using unique column element and also keep header

Answers (1)

Related Questions