Workhorse
Workhorse

Reputation: 1560

Subset table based on strings present in multiple columns

I have a dataframe that looks like this

> df
               Col1                 Col2             P_value  
              Cell1                Cell2               0.001      
              Cell2                Cell1                0.05      
              Cell4                Cell1                0.01     
              Cell5                Cell2                0.03        
              Cell2                Cell3               0.008      
              Cell1                Cell4               0.008      

I want to subset to a new dataframe, in which only strings present in BOTH Col1 and Col2 in both orders. So here, Cell1 and Cell2, when matched, appear in both Col1 and Col2.

> df
               Col1                 Col2             P_value  
              Cell1                Cell2               0.001      
              Cell2                Cell1                0.05 
              Cell1                Cell4               0.008 
              Cell4                Cell1                0.01        

So here, Cell1 and Cell2, when matched, appear in both Col1 and Col2. Same for Cell1 and Cell4. This never happens for other strings.

Upvotes: 1

Views: 58

Answers (3)

PaulS
PaulS

Reputation: 25528

A possible solution, based on inner_join:

library(dplyr)

inner_join(df, df[-3], by = c("Col2" = "Col1", "Col1" = "Col2")) 

#>    Col1  Col2 P_value
#> 1 Cell1 Cell2   0.001
#> 2 Cell2 Cell1   0.050
#> 3 Cell4 Cell1   0.010
#> 4 Cell1 Cell4   0.008

Upvotes: 1

r2evans
r2evans

Reputation: 160952

Perhaps this is over-simplifying, but ...

df %>%
  filter(Col1 %in% Col2 & Col2 %in% Col1)
#    Col1  Col2 P_value
# 1 Cell1 Cell2   <0.05
# 2 Cell2 Cell1   <0.05
# 3 Cell4 Cell1   <0.05
# 4 Cell1 Cell4   <0.05

Upvotes: 1

akrun
akrun

Reputation: 887901

We may use

library(dplyr)
library(stringr)
df %>% 
  mutate(Col = str_c(pmin(Col1, Col2), pmax(Col1, Col2))) %>%
  filter(duplicated(Col)|duplicated(Col, fromLast = TRUE)) %>% 
  select(-Col)

Or may do

 df %>%
   add_count(pmin(Col1, Col2), pmax(Col1, Col2)) %>% 
   filter(n > 1) %>% 
   select(names(df))

-output

   Col1  Col2 P_value
1 Cell1 Cell2   <0.05
2 Cell2 Cell1   <0.05
3 Cell4 Cell1   <0.05
4 Cell1 Cell4   <0.05

data

df <- structure(list(Col1 = c("Cell1", "Cell2", "Cell4", "Cell5", "Cell2", 
"Cell1"), Col2 = c("Cell2", "Cell1", "Cell1", "Cell2", "Cell3", 
"Cell4"), P_value = c("<0.05", "<0.05", "<0.05", "<0.05", "<0.05", 
"<0.05")), class = "data.frame", row.names = c(NA, -6L))

Upvotes: 1

Related Questions