rusalkaguy
rusalkaguy

Reputation: 143

Is there a tool or script to split a phased VCF into two separate haploid VCFs, one for each haplotype? (linux)

I have a phased .vcf file generated by longshot from a MinION sequencing run of diploid, human DNA. I would like to be able to split the file into two haploid files, one for haplotype 1, one for haplotype 2.

Do any of the VCF toolkits provide this function out of the box?

3 variants from my file:

##fileformat=VCFv4.2
##source=Longshot v0.4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth of reads passing MAPQ filter">
##INFO=<ID=AC,Number=R,Type=Integer,Description="Number of Observations of Each Allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">
##FORMAT=<ID=UG,Number=1,Type=String,Description="Unphased Genotype (pre-haplotype-assembly)">
##FORMAT=<ID=UQ,Number=1,Type=Float,Description="Unphased Genotype Quality (pre-haplotype-assembly)">
CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    161499264   .   G   C   500.00  PASS    DP=55;AC=27,27  GT:GQ:PS:UG:UQ  0|1:500.00:161499264:0/1:147.24
chr1    161502368   .   A   G   500.00  PASS    DP=43;AC=4,38   GT:GQ:PS:UG:UQ  1/1:342.00:.:1/1:44.91
chr1    161504083   .   A   C   346.17  PASS    DP=39;AC=19,17  GT:GQ:PS:UG:UQ  1|0:346.17:161499264:0/1:147.24

Upvotes: 1

Views: 1209

Answers (2)

ekerde
ekerde

Reputation: 48

I didn't find a tool so I coded something (not pretty but works)

awk '{if ($1 ~ /^##/) print; \
else if ($1=="#CHROM") { ORS="\t";for (i=1;i<10;i++) print $i;\
for (i=10;i<NF;i++) {print $i"_A\t"$i"_B"}; ORS="\n"; print $NF"_A\t"$NF"_B"}\
else {ORS="\t";for (i=1;i<10;i++) print $i;\
for (i=10;i<NF;i++) print substr($i,0,1)"\t"substr($i,3,1); \
ORS="\n"; print substr($NF,0,1)"\t"substr($NF,3,1)"\n"} }' VCF_FILE

First line to print the header.

On the third line I duplicated the name of the individuals (with NAME_A and NAME_B but you can change it.

Fifth line, I keep only the GT with substr(). If you want to keep the other info you can use substr() as well. For example: substr($i,0,1)substr($i,4,100) will keep the info of the first GT and other fields.

Upvotes: 0

Isin Altinkaya
Isin Altinkaya

Reputation: 459

To extract haplotypes from phased vcf files, you can use samplereplay from RTGtools to generate the haplotype SDF file; then sdf2sam, sdf2fasta, and sdf2fastq to obtain corresponding files of phased haplotypes.

Edit: I haven't noticed that you needed a haploid VCF file. The method above should work if you first convert it to sam then to a VCF again.

Upvotes: 0

Related Questions