Ella Bowles
Ella Bowles

Reputation: 101

How can I count the number of individuals in populations, as listed in order, from a vcf file

I would like to get the number of individuals in each population, in the order in which populations are read in, from a vcf file. The fields of my file look like this

##fileformat=VCFv4.2                                                
##fileDate=20180425                                             
##source="Stacks v1.45"                                             
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">                                              
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">                                               
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">                                                
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">                                             
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Allele Depth">                                               
##FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihood">                                              
##INFO=<ID=locori,Number=1,Type=Character,Description="Orientation the 
corresponding Stacks locus aligns in">                                              
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT   
CHALIFOUR_2003_ChHis-1  CHALIFOUR_2003_ChHis-13 CHALIFOUR_2003_ChHis-14  
CHALIFOUR_2003_ChHis-15
un  1027    13_65   C   T   .   PASS    NS=69;AF=0.188;locori=p GT:DP:AD     
0/1:16:9,7  0/0:39:39,0 0/0:17:17,0 0/0:39:39,0

See example file here vcf file

For example, in the file that I have linked to, I have two populations, Chalifour 2003 and Chalifour 2015. Individuals have a prefix "CHALIFOUR_2003..." that identifies this.

I would like to be able to extract something like: Chalifour_2003* 35 Chalifour 2015* 45

With the "35" and "45" indicating the number of individuals in each population (though these numbers are made up). I don't care at all about the format of the output, I just need the numbers, and it is important that the populations are listed in the order in which they would be read into the file.

Any suggestions for avenues to try to get this information would be much appreciated.

Upvotes: 0

Views: 1369

Answers (1)

GordonShumway
GordonShumway

Reputation: 2056

Using the data.table package to read in the vcf file you can do the following:

library(data.table)
df <- fread("~/Downloads/ChaliNoOddsWithOuts.vcf")
samples <- colnames(df)[-c(1:9)]
table(gsub("(.*_.*)_.*","\\1", samples))

If you don't insist on using R then this is one liner in bash that does the job

grep "#CHROM" file.vcf | tr "\t" "\n " | tail -n +10 | cut -f1,2 -d'_' | uniq -c

Upvotes: 0

Related Questions