Reputation: 488

Remove Rows of Dataframe (Present in Both Columns)

I want to remove from my dataframe rows that have present in two particular columns the same string value, I know that it is possible to remove a row if it has a particular string in the row with:

abs_pres_matrix[!grepl("BGC", abs_pres_matrix$Genome),]

In this case I have a dataframe such as:

  GC1      |   GC2    | Distance
BGC123       BGC23      0.5
BGC123       MBT_13     0.6
BGC134       MBT_13     0.5
BGC123       BGC 134    0.6

Desired Output:

  GC1      |   GC2    | Distance
BGC123       MBT_13     0.6
BGC134       MBT_13     0.5

Hence, I want to remove the columns that both contain the string "BGC"

Upvotes: 0

Answers (4)

Chris Ruehlemann

Reputation: 21442

This solution uses three methods: (i) the rows are pasted into strings using applyand paste0; (ii) the strings are searched for the repeated occurrence of the pattern BGC using regex including backreference (\\1); (iii) those rows that satisfy this condition are removed from the dataframe using -which(or, alternatively, just !):

df[-which(grepl("(BGC).*\\1", apply(df, 1, paste0, collapse = " "))),]
     GC1    GC2 Distance
2 BGC123 MBT_13      0.6
3 BGC134 MBT_13      0.5

Upvotes: 1

NicolasH2

Reputation: 804

your data.frame:

df <- data.frame(  
     GC1 = c("BGC123","BGC123","BGC134","BGC123"), 
     GC2 = c("BGC123","MBT_13","MBT_13","BGC123"),  
     Distance = c(0.5, 0.6,  0.5, 0.6),  
     stringsAsFactors = F
    )

if you just want to delete the rows with "BGC", just go for grepl:

df[!grepl("BGC", df$GC2) , ]
#or
subset(df, !grepl("BGC", df$GC2))

if you want to eliminate the rows where GC1 is exactly like GC2 you can use subset with apply:

subset(df, apply(df, 1, function(x) x[1] %in% x[2]) )

Upvotes: 2

Edward

Reputation: 19394

library(dplyr)

df %>%
  filter_at(vars(starts_with("GC")), all_vars(grepl("BGC", .)))

Upvotes: 0

jay.sf

Reputation: 73602

Using grep.

abs_pres_matrix[!lengths(apply(abs_pres_matrix[, 1:2], 1, grep, pattern="BGC")) > 1,]
#      GC1    GC2 Distance
# 2 BGC123 MBT_13      0.6
# 3 BGC134 MBT_13      0.5

Upvotes: 1

Remove Rows of Dataframe (Present in Both Columns)

Answers (4)

Related Questions