Reputation: 663
I have a dataframe called mydf. I want to look for the presence of c.change
values in Clinvar_Type
. If present, I want to delete everything in grepl("Clinvar, colnames(mydf))
.
This is my data:
mydf <- structure(c("chr1:8045045:A:G", "chr1:8045045:A:G", "chr1:8045045:A:G",
"chr1:17314702:C:T", "chr1:17314702:C:T", "chr1:17314702:C:T",
"c.501A>G", "c.441A>G", "c.414A>G", "c.2775G>A", "c.2658G>A",
"c.2790G>A", "NM_007262.5(PARK7):c.501A>G (p.Ala167=)", "NM_007262.5(PARK7):c.501A>G (p.Ala167=)",
"NM_007262.5(PARK7):c.501A>G (p.Ala167=)", "NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)",
"NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)", "NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)",
"single nucleotide variant", "single nucleotide variant", "single nucleotide variant",
"single nucleotide variant", "single nucleotide variant", "single nucleotide variant",
"HGNC:16369", "HGNC:16369", "HGNC:16369", "HGNC:30213", "HGNC:30213",
"HGNC:30213"), .Dim = 6:5, .Dimnames = list(NULL, c("VarID_build37",
"c.change", "Clinvar_ Name", "Clinvar_ Type", "Clinvar_ HGNC_ID"
)))
Result I want:
VarID_build37 c.change Clinvar_ Name Clinvar_ Type Clinvar_ HGNC_ID
"chr1:8045045:A:G" "c.501A>G" "NM_007262.5(PARK7):c.501A>G (p.Ala167=)" "single nucleotide variant" "HGNC:16369"
"chr1:8045045:A:G" "c.441A>G"
"chr1:8045045:A:G" "c.414A>G"
"chr1:17314702:C:T" "c.2775G>A"
"chr1:17314702:C:T" "c.2658G>A"
"chr1:17314702:C:T" "c.2790G>A" "NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)" "single nucleotide variant" "HGNC:30213"
Upvotes: 1
Views: 93
Reputation: 5138
Here is another Base R solution using mapply()
with grepl()
:
idx <- mapply(function(x, y) !grepl(x, y, fixed = TRUE), mydf[, "c.change"], mydf[, "Clinvar_ Name"])
Or you can use stringi::stri_detect_fixed()
because it is vectorized over string & pattern:
idx2 <- stringi::stri_detect_fixed(mydf[, "Clinvar_ Name"], mydf[, "c.change"], negate = TRUE)
identical(unname(idx), idx2)
[1] TRUE
With either option, use the index, select the columns, and assign NA
to them. Also, keep in mind that a blank character string ""
is different from a missing value NA
in R. Hope this helps!
mydf[idx, 3:5] <- NA
mydf
VarID_build37 c.change Clinvar_ Name Clinvar_ Type Clinvar_ HGNC_ID
[1,] "chr1:8045045:A:G" "c.501A>G" "NM_007262.5(PARK7):c.501A>G (p.Ala167=)" "single nucleotide variant" "HGNC:16369"
[2,] "chr1:8045045:A:G" "c.441A>G" NA NA NA
[3,] "chr1:8045045:A:G" "c.414A>G" NA NA NA
[4,] "chr1:17314702:C:T" "c.2775G>A" NA NA NA
[5,] "chr1:17314702:C:T" "c.2658G>A" NA NA NA
[6,] "chr1:17314702:C:T" "c.2790G>A" "NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)" "single nucleotide variant" "HGNC:30213"
Upvotes: 0
Reputation: 4358
Here is a base R solution. (you can replace ""
with NA
if you prefer).
mydf[,-(1:2)][!apply(mydf,1,function(x) grepl(x["c.change"], x["Clinvar_ Name"])),] <- ""
VarID_build37 c.change Clinvar_ Name Clinvar_ Type Clinvar_ HGNC_ID
[1,] "chr1:8045045:A:G" "c.501A>G" "NM_007262.5(PARK7):c.501A>G (p.Ala167=)" "single nucleotide variant" "HGNC:16369"
[2,] "chr1:8045045:A:G" "c.441A>G" "" "" ""
[3,] "chr1:8045045:A:G" "c.414A>G" "" "" ""
[4,] "chr1:17314702:C:T" "c.2775G>A" "" "" ""
[5,] "chr1:17314702:C:T" "c.2658G>A" "" "" ""
[6,] "chr1:17314702:C:T" "c.2790G>A" "NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)" "single nucleotide variant" "HGNC:30213"
Upvotes: 3