Reputation: 101
I would like to get the number of individuals in each population, in the order in which populations are read in, from a vcf file. The fields of my file look like this
##fileformat=VCFv4.2
##fileDate=20180425
##source="Stacks v1.45"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Allele Depth">
##FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihood">
##INFO=<ID=locori,Number=1,Type=Character,Description="Orientation the
corresponding Stacks locus aligns in">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
CHALIFOUR_2003_ChHis-1 CHALIFOUR_2003_ChHis-13 CHALIFOUR_2003_ChHis-14
CHALIFOUR_2003_ChHis-15
un 1027 13_65 C T . PASS NS=69;AF=0.188;locori=p GT:DP:AD
0/1:16:9,7 0/0:39:39,0 0/0:17:17,0 0/0:39:39,0
See example file here vcf file
For example, in the file that I have linked to, I have two populations, Chalifour 2003 and Chalifour 2015. Individuals have a prefix "CHALIFOUR_2003..." that identifies this.
I would like to be able to extract something like: Chalifour_2003* 35 Chalifour 2015* 45
With the "35" and "45" indicating the number of individuals in each population (though these numbers are made up). I don't care at all about the format of the output, I just need the numbers, and it is important that the populations are listed in the order in which they would be read into the file.
Any suggestions for avenues to try to get this information would be much appreciated.
Upvotes: 0
Views: 1369
Reputation: 2056
Using the data.table
package to read in the vcf file you can do the following:
library(data.table)
df <- fread("~/Downloads/ChaliNoOddsWithOuts.vcf")
samples <- colnames(df)[-c(1:9)]
table(gsub("(.*_.*)_.*","\\1", samples))
If you don't insist on using R
then this is one liner in bash
that does the job
grep "#CHROM" file.vcf | tr "\t" "\n " | tail -n +10 | cut -f1,2 -d'_' | uniq -c
Upvotes: 0