Reputation: 968
I am trying to create a dataframe from a vcf file including just some elements from INFO field. The problem is that values of those elemnts are not always in the same position, so when I load the VCF and split INFO field, I get those specific elements in different columns.
For example:
Pos Score Strand Length
CIPOS=0 SCORE=1 STRAND=+ LEN=634
SCORE=89 STRAND=- LEN=567 UTR=+
CIPOS=9 SCORE=1 STRAND=+ LEN=0
CIPOS=8 SCORE=1 STRAND=+ LEN=1
STRAND=+ LEN=555 UTR=+ B
As you can see, some rows are shifted, because there is no symbol in the vcf for the absence of some INFO element, and the field info is readed as a string, so when splitting I don't know how to tell R to write an NA in the corresponding row of each column.
Is there any way to write each "SCORE=" value in Score column, each "STRAND=" value in Strand column, etc?
Thanks in advance!
Upvotes: 2
Views: 4931
Reputation: 46978
There are packages meant for this, for example VariantAnnotation from Bioconductor. Once you read in the vcf file, the info is packed into a data.frame and you can assess it like below:
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
info(vcf)
DataFrame with 10376 rows and 22 columns
LDAF AVGPOST RSQ ERATE THETA
<numeric> <numeric> <numeric> <numeric> <numeric>
rs7410291 0.3431 0.989 0.9856 0.002 5e-04
rs147922003 0.0091 0.9963 0.8398 5e-04 0.0011
rs114143073 0.0098 0.9891 0.5919 7e-04 8e-04
rs141778433 0.0062 0.995 0.6756 9e-04 3e-04
rs182170314 0.0041 0.9981 0.7909 7e-04 4e-04
... ... ... ... ... ...
rs187302552 9e-04 0.9992 0.5571 3e-04 0.0026
rs9628178 0.0727 0.9997 0.9983 3e-04 0.0011
rs5770892 0.3678 0.9868 0.9784 0.0021 7e-04
rs144055359 0.0011 0.9987 0.5323 5e-04 4e-04
rs114526001 0.0543 0.9622 0.7595 0.0021 5e-04
CIEND CIPOS END HOMLEN
<IntegerList> <IntegerList> <integer> <IntegerList>
rs7410291 NA,NA NA,NA NA
rs147922003 NA,NA NA,NA NA
rs114143073 NA,NA NA,NA NA
rs141778433 NA,NA NA,NA NA
rs182170314 NA,NA NA,NA NA
... ... ... ... ...
rs187302552 NA,NA NA,NA NA
rs9628178 NA,NA NA,NA NA
rs5770892 NA,NA NA,NA NA
rs144055359 NA,NA NA,NA NA
rs114526001 NA,NA NA,NA NA
HOMSEQ SVLEN SVTYPE AC
<CharacterList> <integer> <character> <IntegerList>
rs7410291 NA NA 751
rs147922003 NA NA 20
rs114143073 NA NA 20
rs141778433 NA NA 12
rs182170314 NA NA 8
... ... ... ... ...
rs187302552 NA NA 1
rs9628178 NA NA 158
rs5770892 NA NA 801
rs144055359 NA NA 1
rs114526001 NA NA 113
AN AA AF AMR_AF ASN_AF
<integer> <character> <numeric> <numeric> <numeric>
rs7410291 2184 N 0.34 0.2 0.19
rs147922003 2184 c 0.01 0.01 NA
rs114143073 2184 G 0.01 0.0028 0.02
rs141778433 2184 C 0.01 0.01 NA
rs182170314 2184 C 0.0037 0.01 NA
... ... ... ... ... ...
rs187302552 2184 a 5e-04 NA 0.0017
rs9628178 2184 a 0.07 0.03 0.01
rs5770892 2184 a 0.37 0.32 0.38
rs144055359 2184 g 5e-04 NA NA
rs114526001 2184 g 0.05 0.01 0.01
AFR_AF EUR_AF VT SNPSOURCE
<numeric> <numeric> <character> <CharacterList>
rs7410291 0.83 0.22 SNP LOWCOV
rs147922003 0.02 0.01 SNP LOWCOV
rs114143073 0.01 0.01 SNP LOWCOV
rs141778433 0.02 NA SNP LOWCOV
rs182170314 0.01 NA SNP LOWCOV
... ... ... ... ...
rs187302552 NA NA SNP LOWCOV
rs9628178 0.17 0.08 SNP LOWCOV
rs5770892 0.59 0.23 SNP LOWCOV
rs144055359 NA 0.0013 SNP LOWCOV
rs114526001 0.16 0.04 SNP LOWCOV
You can convert to a data.frame and assess the columns, in this example there's no strand information, but it should work if you have strand:
df = as.data.frame(info(vcf))
df$CIPOS
Upvotes: 1
Reputation: 1690
I am not sure if this is the most straight-forward way of doing it, but it should work.
First of all, try to post your data using dput(yourdata)
. That's easier for people to then work with your specific example.
I assume that each row consists of one "sample", so the trick is to first add the sample identifier, then put it into the long format and then back to a wide format.
The data:
a <- data.frame(Pos = c("CIPOS=0",
"SCORE=89",
"CIPOS=9",
"CIPOS=8",
"STRAND=+"),
Score = c("SCORE=1",
"STRAND=-",
"SCORE=1",
"SCORE=1",
"LEN=555"),
Strand = c("STRAND=+",
"LEN=567",
"STRAND=+",
"STRAND=+",
"UTR=+")
)
dput(a)
#> structure(list(Pos = structure(c(1L, 4L, 3L, 2L, 5L), .Label = c("CIPOS=0",
#> "CIPOS=8", "CIPOS=9", "SCORE=89", "STRAND=+"), class = "factor"),
#> Score = structure(c(2L, 3L, 2L, 2L, 1L), .Label = c("LEN=555",
#> "SCORE=1", "STRAND=-"), class = "factor"), Strand = structure(c(2L,
#> 1L, 2L, 2L, 3L), .Label = c("LEN=567", "STRAND=+", "UTR=+"
#> ), class = "factor")), class = "data.frame", row.names = c(NA,
#> -5L))
And then if you run this code:
library(tidyverse)
a %>% mutate(sample = row_number()) %>%
pivot_longer(-sample) %>%
# here you have to unselect the name column, since this is actually wrong
select(-name) %>%
# separate the strings that contain the data
separate(value, into = c("group", "value"), sep = "=") %>%
# put it back to wide format
pivot_wider( names_from = group,
values_from = value)
#> # A tibble: 5 x 6
#> sample CIPOS SCORE STRAND LEN UTR
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 0 1 + <NA> <NA>
#> 2 2 <NA> 89 - 567 <NA>
#> 3 3 9 1 + <NA> <NA>
#> 4 4 8 1 + <NA> <NA>
#> 5 5 <NA> <NA> + 555 +
Created on 2020-03-04 by the reprex package (v0.3.0)
Upvotes: 0