Reputation: 143
I have a phased .vcf file generated by longshot from a MinION sequencing run of diploid, human DNA. I would like to be able to split the file into two haploid files, one for haplotype 1, one for haplotype 2.
Do any of the VCF toolkits provide this function out of the box?
3 variants from my file:
##fileformat=VCFv4.2
##source=Longshot v0.4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth of reads passing MAPQ filter">
##INFO=<ID=AC,Number=R,Type=Integer,Description="Number of Observations of Each Allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">
##FORMAT=<ID=UG,Number=1,Type=String,Description="Unphased Genotype (pre-haplotype-assembly)">
##FORMAT=<ID=UQ,Number=1,Type=Float,Description="Unphased Genotype Quality (pre-haplotype-assembly)">
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr1 161499264 . G C 500.00 PASS DP=55;AC=27,27 GT:GQ:PS:UG:UQ 0|1:500.00:161499264:0/1:147.24
chr1 161502368 . A G 500.00 PASS DP=43;AC=4,38 GT:GQ:PS:UG:UQ 1/1:342.00:.:1/1:44.91
chr1 161504083 . A C 346.17 PASS DP=39;AC=19,17 GT:GQ:PS:UG:UQ 1|0:346.17:161499264:0/1:147.24
Upvotes: 1
Views: 1209
Reputation: 48
I didn't find a tool so I coded something (not pretty but works)
awk '{if ($1 ~ /^##/) print; \
else if ($1=="#CHROM") { ORS="\t";for (i=1;i<10;i++) print $i;\
for (i=10;i<NF;i++) {print $i"_A\t"$i"_B"}; ORS="\n"; print $NF"_A\t"$NF"_B"}\
else {ORS="\t";for (i=1;i<10;i++) print $i;\
for (i=10;i<NF;i++) print substr($i,0,1)"\t"substr($i,3,1); \
ORS="\n"; print substr($NF,0,1)"\t"substr($NF,3,1)"\n"} }' VCF_FILE
First line to print the header.
On the third line I duplicated the name of the individuals (with NAME_A and NAME_B but you can change it.
Fifth line, I keep only the GT with substr()
.
If you want to keep the other info you can use substr()
as well.
For example: substr($i,0,1)substr($i,4,100)
will keep the info of the first GT and other fields.
Upvotes: 0
Reputation: 459
To extract haplotypes from phased vcf files, you can use samplereplay
from RTGtools to generate the haplotype SDF file; then sdf2sam, sdf2fasta, and sdf2fastq to obtain corresponding files of phased haplotypes.
Edit: I haven't noticed that you needed a haploid VCF file. The method above should work if you first convert it to sam then to a VCF again.
Upvotes: 0