Reputation: 5719
I have a tsv file called myfile.tsv. I want to split this file based on unique element in chr column using awk/gawk/bash
or any faster command line and get chr1.tsv (header+row1), chr2.tsv (header+rows2 and 3),chrX.tsv(header+row4),chrY.tsv(header+rows5and6) and chrM.tsv(header+last row).
myfile.tsv
chr value region
chr1 223 554
chr2 433 444
chr2 443 454
chrX 445 444
chrY 445 443
chrY 435 243
chrM 543 544
Upvotes: 1
Views: 1261
Reputation: 74655
Here's a little script that does what you're looking for:
NR == 1 {
header = $0
next
}
{
outfile = "chr" $1 ".tsv"
if (!seen[$1]++) {
print header > outfile
}
print > outfile
}
The first row is saved, so it can be used later. The other lines are printed to file matching the value of the first field. The header is added if the value hasn't been seen yet.
NR
is the record number, so NR == 1
is only true when the record number is one (i.e. the first line). In this block, the whole line $0
is saved to the variable header
. next
skips any other blocks and moves to the next line. This means that the second block (which would otherwise be run unconditionally on every record) is skipped.
For every other line in the file, the output filename is built using the value of the first field. The array seen
keeps a track of values of $1
. !seen[$1]++
is only true the first time a given value of $1
is seen, as the value of seen[$1]
is incremented every time it is checked. If the value of $1
has not yet been seen, the header is written to the output file.
Every line is written to the output file.
Upvotes: 5