Parse VCF file's INFO to an R dataframe

Question

I am trying to create a dataframe from a vcf file including just some elements from INFO field. The problem is that values of those elemnts are not always in the same position, so when I load the VCF and split INFO field, I get those specific elements in different columns.

For example:

Pos         Score       Strand     Length     
CIPOS=0     SCORE=1     STRAND=+   LEN=634
SCORE=89    STRAND=-  LEN=567      UTR=+
CIPOS=9     SCORE=1     STRAND=+   LEN=0
CIPOS=8     SCORE=1     STRAND=+   LEN=1
STRAND=+    LEN=555     UTR=+      B

As you can see, some rows are shifted, because there is no symbol in the vcf for the absence of some INFO element, and the field info is readed as a string, so when splitting I don't know how to tell R to write an NA in the corresponding row of each column.

Is there any way to write each "SCORE=" value in Score column, each "STRAND=" value in Strand column, etc?

Thanks in advance!

StupidWolf · Accepted Answer

There are packages meant for this, for example VariantAnnotation from Bioconductor. Once you read in the vcf file, the info is packed into a data.frame and you can assess it like below:

library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
info(vcf)

DataFrame with 10376 rows and 22 columns
                 LDAF   AVGPOST       RSQ     ERATE     THETA
                
rs7410291      0.3431     0.989    0.9856     0.002     5e-04
rs147922003    0.0091    0.9963    0.8398     5e-04    0.0011
rs114143073    0.0098    0.9891    0.5919     7e-04     8e-04
rs141778433    0.0062     0.995    0.6756     9e-04     3e-04
rs182170314    0.0041    0.9981    0.7909     7e-04     4e-04
...               ...       ...       ...       ...       ...
rs187302552     9e-04    0.9992    0.5571     3e-04    0.0026
rs9628178      0.0727    0.9997    0.9983     3e-04    0.0011
rs5770892      0.3678    0.9868    0.9784    0.0021     7e-04
rs144055359    0.0011    0.9987    0.5323     5e-04     4e-04
rs114526001    0.0543    0.9622    0.7595    0.0021     5e-04
                    CIEND         CIPOS       END        HOMLEN
               
rs7410291           NA,NA         NA,NA        NA              
rs147922003         NA,NA         NA,NA        NA              
rs114143073         NA,NA         NA,NA        NA              
rs141778433         NA,NA         NA,NA        NA              
rs182170314         NA,NA         NA,NA        NA              
...                   ...           ...       ...           ...
rs187302552         NA,NA         NA,NA        NA              
rs9628178           NA,NA         NA,NA        NA              
rs5770892           NA,NA         NA,NA        NA              
rs144055359         NA,NA         NA,NA        NA              
rs114526001         NA,NA         NA,NA        NA              
                     HOMSEQ     SVLEN      SVTYPE            AC
               
rs7410291                          NA          NA           751
rs147922003                        NA          NA            20
rs114143073                        NA          NA            20
rs141778433                        NA          NA            12
rs182170314                        NA          NA             8
...                     ...       ...         ...           ...
rs187302552                        NA          NA             1
rs9628178                          NA          NA           158
rs5770892                          NA          NA           801
rs144055359                        NA          NA             1
rs114526001                        NA          NA           113
                   AN          AA        AF    AMR_AF    ASN_AF
                
rs7410291        2184           N      0.34       0.2      0.19
rs147922003      2184           c      0.01      0.01        NA
rs114143073      2184           G      0.01    0.0028      0.02
rs141778433      2184           C      0.01      0.01        NA
rs182170314      2184           C    0.0037      0.01        NA
...               ...         ...       ...       ...       ...
rs187302552      2184           a     5e-04        NA    0.0017
rs9628178        2184           a      0.07      0.03      0.01
rs5770892        2184           a      0.37      0.32      0.38
rs144055359      2184           g     5e-04        NA        NA
rs114526001      2184           g      0.05      0.01      0.01
               AFR_AF    EUR_AF          VT       SNPSOURCE
               
rs7410291        0.83      0.22         SNP          LOWCOV
rs147922003      0.02      0.01         SNP          LOWCOV
rs114143073      0.01      0.01         SNP          LOWCOV
rs141778433      0.02        NA         SNP          LOWCOV
rs182170314      0.01        NA         SNP          LOWCOV
...               ...       ...         ...             ...
rs187302552        NA        NA         SNP          LOWCOV
rs9628178        0.17      0.08         SNP          LOWCOV
rs5770892        0.59      0.23         SNP          LOWCOV
rs144055359        NA    0.0013         SNP          LOWCOV
rs114526001      0.16      0.04         SNP          LOWCOV

You can convert to a data.frame and assess the columns, in this example there's no strand information, but it should work if you have strand:

df = as.data.frame(info(vcf))
df$CIPOS

Parse VCF file's INFO to an R dataframe

Answers (2)

Related Questions

Parse VCF file&#39;s INFO to an R dataframe

Answers (2)

Related Questions

Parse VCF file's INFO to an R dataframe