Jeni
Jeni

Reputation: 968

Parse VCF file's INFO to an R dataframe

I am trying to create a dataframe from a vcf file including just some elements from INFO field. The problem is that values of those elemnts are not always in the same position, so when I load the VCF and split INFO field, I get those specific elements in different columns.

For example:

Pos         Score       Strand     Length     
CIPOS=0     SCORE=1     STRAND=+   LEN=634
SCORE=89    STRAND=-  LEN=567      UTR=+
CIPOS=9     SCORE=1     STRAND=+   LEN=0
CIPOS=8     SCORE=1     STRAND=+   LEN=1
STRAND=+    LEN=555     UTR=+      B

As you can see, some rows are shifted, because there is no symbol in the vcf for the absence of some INFO element, and the field info is readed as a string, so when splitting I don't know how to tell R to write an NA in the corresponding row of each column.

Is there any way to write each "SCORE=" value in Score column, each "STRAND=" value in Strand column, etc?

Thanks in advance!

Upvotes: 2

Views: 4931

Answers (2)

StupidWolf
StupidWolf

Reputation: 46978

There are packages meant for this, for example VariantAnnotation from Bioconductor. Once you read in the vcf file, the info is packed into a data.frame and you can assess it like below:

library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
info(vcf)

DataFrame with 10376 rows and 22 columns
                 LDAF   AVGPOST       RSQ     ERATE     THETA
            <numeric> <numeric> <numeric> <numeric> <numeric>
rs7410291      0.3431     0.989    0.9856     0.002     5e-04
rs147922003    0.0091    0.9963    0.8398     5e-04    0.0011
rs114143073    0.0098    0.9891    0.5919     7e-04     8e-04
rs141778433    0.0062     0.995    0.6756     9e-04     3e-04
rs182170314    0.0041    0.9981    0.7909     7e-04     4e-04
...               ...       ...       ...       ...       ...
rs187302552     9e-04    0.9992    0.5571     3e-04    0.0026
rs9628178      0.0727    0.9997    0.9983     3e-04    0.0011
rs5770892      0.3678    0.9868    0.9784    0.0021     7e-04
rs144055359    0.0011    0.9987    0.5323     5e-04     4e-04
rs114526001    0.0543    0.9622    0.7595    0.0021     5e-04
                    CIEND         CIPOS       END        HOMLEN
            <IntegerList> <IntegerList> <integer> <IntegerList>
rs7410291           NA,NA         NA,NA        NA              
rs147922003         NA,NA         NA,NA        NA              
rs114143073         NA,NA         NA,NA        NA              
rs141778433         NA,NA         NA,NA        NA              
rs182170314         NA,NA         NA,NA        NA              
...                   ...           ...       ...           ...
rs187302552         NA,NA         NA,NA        NA              
rs9628178           NA,NA         NA,NA        NA              
rs5770892           NA,NA         NA,NA        NA              
rs144055359         NA,NA         NA,NA        NA              
rs114526001         NA,NA         NA,NA        NA              
                     HOMSEQ     SVLEN      SVTYPE            AC
            <CharacterList> <integer> <character> <IntegerList>
rs7410291                          NA          NA           751
rs147922003                        NA          NA            20
rs114143073                        NA          NA            20
rs141778433                        NA          NA            12
rs182170314                        NA          NA             8
...                     ...       ...         ...           ...
rs187302552                        NA          NA             1
rs9628178                          NA          NA           158
rs5770892                          NA          NA           801
rs144055359                        NA          NA             1
rs114526001                        NA          NA           113
                   AN          AA        AF    AMR_AF    ASN_AF
            <integer> <character> <numeric> <numeric> <numeric>
rs7410291        2184           N      0.34       0.2      0.19
rs147922003      2184           c      0.01      0.01        NA
rs114143073      2184           G      0.01    0.0028      0.02
rs141778433      2184           C      0.01      0.01        NA
rs182170314      2184           C    0.0037      0.01        NA
...               ...         ...       ...       ...       ...
rs187302552      2184           a     5e-04        NA    0.0017
rs9628178        2184           a      0.07      0.03      0.01
rs5770892        2184           a      0.37      0.32      0.38
rs144055359      2184           g     5e-04        NA        NA
rs114526001      2184           g      0.05      0.01      0.01
               AFR_AF    EUR_AF          VT       SNPSOURCE
            <numeric> <numeric> <character> <CharacterList>
rs7410291        0.83      0.22         SNP          LOWCOV
rs147922003      0.02      0.01         SNP          LOWCOV
rs114143073      0.01      0.01         SNP          LOWCOV
rs141778433      0.02        NA         SNP          LOWCOV
rs182170314      0.01        NA         SNP          LOWCOV
...               ...       ...         ...             ...
rs187302552        NA        NA         SNP          LOWCOV
rs9628178        0.17      0.08         SNP          LOWCOV
rs5770892        0.59      0.23         SNP          LOWCOV
rs144055359        NA    0.0013         SNP          LOWCOV
rs114526001      0.16      0.04         SNP          LOWCOV

You can convert to a data.frame and assess the columns, in this example there's no strand information, but it should work if you have strand:

df = as.data.frame(info(vcf))
df$CIPOS

Upvotes: 1

MKR
MKR

Reputation: 1690

I am not sure if this is the most straight-forward way of doing it, but it should work.

First of all, try to post your data using dput(yourdata). That's easier for people to then work with your specific example.

I assume that each row consists of one "sample", so the trick is to first add the sample identifier, then put it into the long format and then back to a wide format.

The data:

a <- data.frame(Pos = c("CIPOS=0", 
                            "SCORE=89",
                            "CIPOS=9",
                            "CIPOS=8", 
                            "STRAND=+"),
                    Score = c("SCORE=1",
                              "STRAND=-", 
                              "SCORE=1", 
                              "SCORE=1", 
                              "LEN=555"),
                    Strand = c("STRAND=+",
                               "LEN=567",  
                               "STRAND=+",
                               "STRAND=+",
                               "UTR=+")
    )

dput(a)
    #> structure(list(Pos = structure(c(1L, 4L, 3L, 2L, 5L), .Label = c("CIPOS=0", 
    #> "CIPOS=8", "CIPOS=9", "SCORE=89", "STRAND=+"), class = "factor"), 
    #>     Score = structure(c(2L, 3L, 2L, 2L, 1L), .Label = c("LEN=555", 
    #>     "SCORE=1", "STRAND=-"), class = "factor"), Strand = structure(c(2L, 
    #>     1L, 2L, 2L, 3L), .Label = c("LEN=567", "STRAND=+", "UTR=+"
    #>     ), class = "factor")), class = "data.frame", row.names = c(NA, 
    #> -5L))

And then if you run this code:

library(tidyverse)

    a %>% mutate(sample = row_number()) %>% 
      pivot_longer(-sample) %>%
      # here you have to unselect the name column, since this is actually wrong
      select(-name) %>% 
      # separate the strings that contain the data
      separate(value, into = c("group", "value"), sep = "=") %>% 
      # put it back to wide format
      pivot_wider( names_from = group, 
                   values_from = value)
    #> # A tibble: 5 x 6
    #>   sample CIPOS SCORE STRAND LEN   UTR  
    #>    <int> <chr> <chr> <chr>  <chr> <chr>
    #> 1      1 0     1     +      <NA>  <NA> 
    #> 2      2 <NA>  89    -      567   <NA> 
    #> 3      3 9     1     +      <NA>  <NA> 
    #> 4      4 8     1     +      <NA>  <NA> 
    #> 5      5 <NA>  <NA>  +      555   +

Created on 2020-03-04 by the reprex package (v0.3.0)

Upvotes: 0

Related Questions