Reputation: 361
I am working with a tsv file that contains information on genetic variants. Given that this information comes from multiple databases, my file contains duplicate data. In my case the databases are "dbSNP", which contains all kinds of variants (both benign and pathogenic ones, whose names begin with "rs") and "HGMD-PUBLIC" (which in my case only contains pathogenic variants, whose names begin with "CM").
Here is some sample data:
variant_name <- c("rs1320425680", "rs1217211965", "rs1257939387", "rs74315401", "CM890102")
variant_source <- c("dbSNP", "dbSNP", "dbSNP", "dbSNP", "HGMD_PUBLIC")
chromosome <- c(20, 20, 20, 20, 20)
position_start <- c(4699513, 4699516, 4699520, 4699525, 4699525)
position_end <- c(4699513, 4699516, 4699520, 4699525, 4699525)
ancestral_allele <- c("A", "G", "C", "C", "C")
clinical_significance <- c(NA, NA, NA, "pathogenic", NA)
variant_alleles <- c("A/C", "G/A/T", "C/G", "C/T", "HGMD_MUTATION")
example <- tibble(variant_name, variant_source, chromosome, position_start, position_end, ancestral_allele, clinical_significance, variant_alleles)
Here, the 4th and 5th rows are actually duplicates of the same variant from two different databases. It is important to me to know that the variant appears in both.
What I would like to do is to spread the data into "wide" format, with two new columns: dbSNP
and HGMD_PUBLIC
(which are currently found in variant_source
). The columns would contain the entry originally found in the column variant_name
, like this:
dbSNP <- c("rs1320425680", "rs1217211965", "rs1257939387", "rs74315401")
HGMD_PUBLIC <- c(NA, NA, NA, "CM890102")
chromosome <- c(20, 20, 20, 20)
position_start <- c(4699513, 4699516, 4699520, 4699525)
position_end <- c(4699513, 4699516, 4699520, 4699525)
ancestral_allele <- c("A", "G", "C", "C")
clinical_significance <- c(NA, NA, NA, "pathogenic")
variant_alleles <- c("A/C", "G/A/T", "C/G", "C/T")
result <- tibble(dbSNP, HGMD_PUBLIC, chromosome, position_start, position_end, ancestral_allele, clinical_significance, variant_alleles)
As you can see, the 5th row is no longer present, but the variant_name
has been moved to the dbSNP
or HGMD_PUBLIC
columns.
Perhaps the above could be achieved with the spread
function, making the rearrangement conditional on the name, but I don't know how to do that.
Upvotes: 1
Views: 43
Reputation: 101034
Here is a data.table
option using dcase
+ fcoalesce
dcast(
setDT(example),
... ~ variant_source,
value.var = "variant_name"
)[,
lapply(.SD, function(x) fcoalesce(as.list(x))),
by = chromosome:ancestral_allele
]
which gives
chromosome position_start position_end ancestral_allele
1: 20 4699513 4699513 A
2: 20 4699516 4699516 G
3: 20 4699520 4699520 C
4: 20 4699525 4699525 C
clinical_significance variant_alleles HGMD_PUBLIC dbSNP
1: <NA> A/C <NA> rs1320425680
2: <NA> G/A/T <NA> rs1217211965
3: <NA> C/G <NA> rs1257939387
4: pathogenic HGMD_MUTATION CM890102 rs74315401
Upvotes: 1