Combine rows with same content

Question

I am working with a tsv file that contains information on genetic variants. Given that this information comes from multiple databases, my file contains duplicate data. In my case the databases are "dbSNP", which contains all kinds of variants (both benign and pathogenic ones, whose names begin with "rs") and "HGMD-PUBLIC" (which in my case only contains pathogenic variants, whose names begin with "CM").

Here is some sample data:

variant_name <- c("rs1320425680", "rs1217211965", "rs1257939387", "rs74315401", "CM890102")
variant_source <- c("dbSNP", "dbSNP", "dbSNP", "dbSNP", "HGMD_PUBLIC")
chromosome <- c(20, 20, 20, 20, 20)
position_start <- c(4699513, 4699516, 4699520, 4699525, 4699525)
position_end <- c(4699513, 4699516, 4699520, 4699525, 4699525)
ancestral_allele <- c("A", "G", "C", "C", "C")
clinical_significance <- c(NA, NA, NA, "pathogenic", NA)
variant_alleles <- c("A/C", "G/A/T", "C/G", "C/T", "HGMD_MUTATION")

example <- tibble(variant_name, variant_source, chromosome, position_start, position_end, ancestral_allele, clinical_significance, variant_alleles)

Here, the 4th and 5th rows are actually duplicates of the same variant from two different databases. It is important to me to know that the variant appears in both.

What I would like to do is to spread the data into "wide" format, with two new columns: dbSNP and HGMD_PUBLIC (which are currently found in variant_source). The columns would contain the entry originally found in the column variant_name , like this:

dbSNP <- c("rs1320425680", "rs1217211965", "rs1257939387", "rs74315401")
HGMD_PUBLIC <- c(NA, NA, NA, "CM890102")
chromosome <- c(20, 20, 20, 20)
position_start <- c(4699513, 4699516, 4699520, 4699525)
position_end <- c(4699513, 4699516, 4699520, 4699525)
ancestral_allele <- c("A", "G", "C", "C")
clinical_significance <- c(NA, NA, NA, "pathogenic")
variant_alleles <- c("A/C", "G/A/T", "C/G", "C/T")

result <- tibble(dbSNP, HGMD_PUBLIC, chromosome, position_start, position_end, ancestral_allele, clinical_significance, variant_alleles)

As you can see, the 5th row is no longer present, but the variant_name has been moved to the dbSNP or HGMD_PUBLIC columns. Perhaps the above could be achieved with the spread function, making the rearrangement conditional on the name, but I don't know how to do that.

ThomasIsCoding · Accepted Answer

Here is a data.table option using dcase + fcoalesce

dcast(
  setDT(example),
  ... ~ variant_source,
  value.var = "variant_name"
)[,
  lapply(.SD, function(x) fcoalesce(as.list(x))),
  by = chromosome:ancestral_allele
]

which gives

   chromosome position_start position_end ancestral_allele
1:         20        4699513      4699513                A
2:         20        4699516      4699516                G
3:         20        4699520      4699520                C
4:         20        4699525      4699525                C
   clinical_significance variant_alleles HGMD_PUBLIC        dbSNP
1:                               A/C         rs1320425680
2:                             G/A/T         rs1217211965
3:                               C/G         rs1257939387
4:            pathogenic   HGMD_MUTATION    CM890102   rs74315401

Combine rows with same content

Answers (1)

Related Questions