Yamuna_dhungana
Yamuna_dhungana

Reputation: 663

How to delete grepl-unmatched column values in R

I have a dataframe called mydf. I want to look for the presence of c.change values in Clinvar_Type. If present, I want to delete everything in grepl("Clinvar, colnames(mydf)).

This is my data:

mydf <- structure(c("chr1:8045045:A:G", "chr1:8045045:A:G", "chr1:8045045:A:G", 
"chr1:17314702:C:T", "chr1:17314702:C:T", "chr1:17314702:C:T", 
"c.501A>G", "c.441A>G", "c.414A>G", "c.2775G>A", "c.2658G>A", 
"c.2790G>A", "NM_007262.5(PARK7):c.501A>G (p.Ala167=)", "NM_007262.5(PARK7):c.501A>G (p.Ala167=)", 
"NM_007262.5(PARK7):c.501A>G (p.Ala167=)", "NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)", 
"NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)", "NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)", 
"single nucleotide variant", "single nucleotide variant", "single nucleotide variant", 
"single nucleotide variant", "single nucleotide variant", "single nucleotide variant", 
"HGNC:16369", "HGNC:16369", "HGNC:16369", "HGNC:30213", "HGNC:30213", 
"HGNC:30213"), .Dim = 6:5, .Dimnames = list(NULL, c("VarID_build37", 
"c.change", "Clinvar_ Name", "Clinvar_ Type", "Clinvar_ HGNC_ID"
)))

Result I want:

    VarID_build37       c.change    Clinvar_ Name                                Clinvar_ Type               Clinvar_ HGNC_ID
 "chr1:8045045:A:G"  "c.501A>G"  "NM_007262.5(PARK7):c.501A>G (p.Ala167=)"    "single nucleotide variant" "HGNC:16369"    
"chr1:8045045:A:G"  "c.441A>G"     
"chr1:8045045:A:G"  "c.414A>G"     
"chr1:17314702:C:T" "c.2775G>A" 
"chr1:17314702:C:T" "c.2658G>A" 
"chr1:17314702:C:T" "c.2790G>A" "NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)" "single nucleotide variant" "HGNC:30213"  

Upvotes: 1

Views: 93

Answers (2)

Andrew
Andrew

Reputation: 5138

Here is another Base R solution using mapply() with grepl():

idx <- mapply(function(x, y) !grepl(x, y, fixed = TRUE), mydf[, "c.change"], mydf[, "Clinvar_ Name"])

Or you can use stringi::stri_detect_fixed() because it is vectorized over string & pattern:

idx2 <- stringi::stri_detect_fixed(mydf[, "Clinvar_ Name"], mydf[, "c.change"], negate = TRUE)

identical(unname(idx), idx2)
[1] TRUE

With either option, use the index, select the columns, and assign NA to them. Also, keep in mind that a blank character string "" is different from a missing value NA in R. Hope this helps!

mydf[idx, 3:5] <- NA

mydf
     VarID_build37       c.change    Clinvar_ Name                                Clinvar_ Type               Clinvar_ HGNC_ID
[1,] "chr1:8045045:A:G"  "c.501A>G"  "NM_007262.5(PARK7):c.501A>G (p.Ala167=)"    "single nucleotide variant" "HGNC:16369"    
[2,] "chr1:8045045:A:G"  "c.441A>G"  NA                                           NA                          NA              
[3,] "chr1:8045045:A:G"  "c.414A>G"  NA                                           NA                          NA              
[4,] "chr1:17314702:C:T" "c.2775G>A" NA                                           NA                          NA              
[5,] "chr1:17314702:C:T" "c.2658G>A" NA                                           NA                          NA              
[6,] "chr1:17314702:C:T" "c.2790G>A" "NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)" "single nucleotide variant" "HGNC:30213"    

Upvotes: 0

Daniel O
Daniel O

Reputation: 4358

Here is a base R solution. (you can replace "" with NA if you prefer).

mydf[,-(1:2)][!apply(mydf,1,function(x) grepl(x["c.change"], x["Clinvar_ Name"])),] <- ""

    VarID_build37       c.change    Clinvar_ Name                                Clinvar_ Type               Clinvar_ HGNC_ID
[1,] "chr1:8045045:A:G"  "c.501A>G"  "NM_007262.5(PARK7):c.501A>G (p.Ala167=)"    "single nucleotide variant" "HGNC:16369"    
[2,] "chr1:8045045:A:G"  "c.441A>G"  ""                                           ""                          ""              
[3,] "chr1:8045045:A:G"  "c.414A>G"  ""                                           ""                          ""              
[4,] "chr1:17314702:C:T" "c.2775G>A" ""                                           ""                          ""              
[5,] "chr1:17314702:C:T" "c.2658G>A" ""                                           ""                          ""              
[6,] "chr1:17314702:C:T" "c.2790G>A" "NM_022089.4(ATP13A2):c.2790G>A (p.Ser930=)" "single nucleotide variant" "HGNC:30213" 

Upvotes: 3

Related Questions