justaguy
justaguy

Reputation: 3022

awk duplicated lines with starting with # symbol

In the below awk is there a way to process only lines below a pattern #CHROM, however print all in the output. The problem I am having is if I ignore all lines with a # they do print in the output, but the other lines without the # get duplicated. In my data file there are thousands of lines but only the oone format below is updated by the awk. Thank you :).

file tab-delimited

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
chr1    948797  .   C   .   0   PASS    DP=159;END=948845;MAX_DP=224;MIN_DP=95  GT:DP:MIN_DP:MAX_DP 0/0:159:95:224

awk

awk '!/^#/
BEGIN {FS = OFS = "\t"
}
NF == 10 {
split($8, a, /[=;]/)
$11 = $12 = $13 = $14 = $15 = $18 = "."
$16 = (a[1] == "DP") ? a[2] : "DP=num_Missing"
$17 = "homref"
}
1' out > ref

curent output tab-delimited

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
chr1    948797  .   C   .   0   PASS    DP=159;END=948845;MAX_DP=224;MIN_DP=95  GT:DP:MIN_DP:MAX_DP 0/0:159:95:224   --- duplicated line ---
chr1    948797  .   C   .   0   PASS    DP=159;END=948845;MAX_DP=224;MIN_DP=95  GT:DP:MIN_DP:MAX_DP 0/0:159:95:224  .   .   .   .   .   159 homref  .    --- this line is correct ---

desired output tab-delimited

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
chr1    948797  .   C   .   0   PASS    DP=159;END=948845;MAX_DP=224;MIN_DP=95  GT:DP:MIN_DP:MAX_DP 0/0:159:95:224  .   .   .   .   .   159 homref  .

Upvotes: 1

Views: 68

Answers (1)

Ed Morton
Ed Morton

Reputation: 203502

Your first statement:

/^#/

says "print every line that starts with #" and your last:

1

says "print every line". Hence the duplicate lines in the output.

To only modify lines that don't start with # but print all lines would be:

!/^#/ { do stuff }
1

Upvotes: 1

Related Questions