Reputation: 5935

Filtering a Data Frame based on Row Conditions

I thought of the following example to illustrate my question.

Suppose there are 5 balls:

Red
Blue
Green
Yellow
Orange

Normally, there are 5! = 120 ways these balls can be organized (n!). I can enumerate all these combinations below:

library(combinat)
library(dplyr)

my_list = c("Red", "Blue", "Green", "Yellow", "Orange")

d = permn(my_list)

all_combinations  = as.data.frame(matrix(unlist(d), ncol = 120)) %>%
  setNames(paste0("col", 1:120))


all_combinations[,1:5]

    col1   col2   col3   col4   col5
1    Red    Red    Red    Red Orange
2   Blue   Blue   Blue Orange    Red
3  Green  Green Orange   Blue   Blue
4 Yellow Orange  Green  Green  Green
5 Orange Yellow Yellow Yellow Yellow

My Question:

Suppose I wanted to filter this list by the following conditions:

The "Red" Ball can either be in the first or second position (From Left to Right)
There must be at least 2 positions between the "Blue" Ball and the "Green" Ball
The "Yellow" Ball can not be in the last position

I then tried to filter the above data based on these 3 conditons:

# attempt to write first condition
    cond_1 <- all_combinations[which(all_combinations[1,]== "Red" || all_combinations[2,] == "Red"), ]

#not sure how to write the second condition
    
 # attempt to write the third condition   
    cond_3 <- data_frame_version[which(data_frame_version[5,] !== "Yellow" ), ]

# if everything worked, an "anti join" style statement could be written to remove "cond_1, cond_2, cond_3" from the original data?

But this is not working at all - the first and third condition return a data frame only containing 4 rows for all the columns.

Can someone please show me how to correctly filter "all_combinations" using the above 3 filters?

Note:

The following code can transpose the original data:

 library(data.table)

    tpose = transpose(all_combinations)

    df = tpose
    
#group every 5 rows by the same id to identify unique combinations

    bloc_len <- 5
    
    df$bloc <- 
        rep(seq(1, 1 + nrow(df) %/% bloc_len), each = bloc_len, length.out = nrow(df))
    
   
 head(df)

      V1     V2     V3     V4     V5 bloc
1    Red   Blue  Green Yellow Orange    1
2    Red   Blue  Green Orange Yellow    1
3    Red   Blue Orange  Green Yellow    1
4    Red Orange   Blue  Green Yellow    1
5 Orange    Red   Blue  Green Yellow    1
6 Orange    Red   Blue Yellow  Green    2

Upvotes: 2

Answers (6)

Henry

Reputation: 6784

I think you may find it easier to apply the three conditions if you consider the position of each ball rather than the ball in each position, as in

library(combinat)
my_list = c("Red", "Blue", "Green", "Yellow", "Orange")
d = permn(1:5)
md = matrix(unlist(d), ncol=5, byrow=TRUE)
colnames(md) = my_list

ok = md[, "Red"] <= 2 & 
     abs(md[, "Blue"] - md[, "Green"]) > 2 & 
     md[, "Yellow"] != 5
sum(ok)
# 10
md[ok, ] 
#      Red Blue Green Yellow Orange
# [1,]   1    2     5      3      4
# [2,]   1    5     2      3      4
# [3,]   1    5     2      4      3
# [4,]   1    2     5      4      3
# [5,]   2    4     1      3      5
# [6,]   2    1     4      3      5
# [7,]   2    1     5      4      3
# [8,]   2    5     1      4      3
# [9,]   2    5     1      3      4
#[10,]   2    1     5      3      4

Upvotes: 1

D.J

Reputation: 1374

Maybe I misread the question but as I see it, none of the answers seem to show a solution where there are 2 columns between the colors in step 2 of the question.

I took the liberty of testing the data and found that only if you use "Yellow" and "Orange" you can find a filter condition doing what you want (as far as I understand it).

This is not a general answer and it is not actually correct because "Yellow" is on the last row, violating the rule but:

To have distance of 2 between your colors while the last row is already accounted for reduces the problem to a 4 column problem. So a distance of 2 can only be achieved between column 1 and column 4. This leads to 4 assumptions:

Column 1 would need to be either "Green" or "Blue"
Column 2 would need to be "Red"
Column 3 should be neither "Green" nor "Blue"
Column 4 should be again either "Green" or "Blue" but not Column 1

This is the code I came up with, not beautiful and as explained "Green" and "Blue" switched to "Yellow" and "Orange" but else I think this works.

library(combinat)
library(tidyverse)

my_list = c("Red", "Blue", "Green", "Yellow", "Orange")

d = permn(my_list)

all_combinations  = as.data.frame(matrix(unlist(d), ncol = 5)) %>%
  setNames(paste0("col", 1:5))

`%!in%` <- Negate(`%in%`)

combis <- all_combinations %>% 
  filter(col1 %in% c("Yellow", "Orange"), 
         col2 == "Red", 
         !col3 %in% c("Yellow", "Orange"), 
         col5 == "Yellow") 

results <- vector()
for(i in seq_along(combis[,1])){
  
  if(combis[i,][1] %!in% c(combis[i,][4], "Red", "Green", "Blue")){
    results <- combis[i,] 
  }
}

results

    col1 col2  col3   col4   col5
3 Yellow  Red Green Orange Yellow

Upvotes: 1

kybazzi

Reputation: 1030

Here's a scalable tidyverse solution.

First, let's make the data a tibble of 120 rows, one for each combination of balls.

library(tidyverse)
library(combinat)
data = my_list %>% 
  permn() %>%
  map(~ set_names(.x, paste0("ball", 1:5))) %>%
  do.call(bind_rows, args = .) %>%
  mutate(id = row_number())

Our data:

# A tibble: 120 x 6
   ball1  ball2  ball3  ball4  ball5     id
   <chr>  <chr>  <chr>  <chr>  <chr>  <int>
 1 Red    Blue   Green  Yellow Orange     1
 2 Red    Blue   Green  Orange Yellow     2
 3 Red    Blue   Orange Green  Yellow     3
 4 Red    Orange Blue   Green  Yellow     4
 5 Orange Red    Blue   Green  Yellow     5
 6 Orange Red    Blue   Yellow Green      6
 7 Red    Orange Blue   Yellow Green      7
 8 Red    Blue   Orange Yellow Green      8
 9 Red    Blue   Yellow Orange Green      9
10 Red    Blue   Yellow Green  Orange    10
# ... with 110 more rows

The key idea of this solution is to transform the data into long format. This will make it trivial to check for each condition. Afterwards we can bring it back to wide format.

data %>%
  pivot_longer(-id) %>%
  mutate(ball_number = as.numeric(str_extract(name, "[1-5]"))) %>%
  group_by(id) %>%
  filter(
    # Condition 1
    ball_number[value == "Red"] %in% c(1, 2),
    # Condition 2
    abs(ball_number[value == "Blue"] - ball_number[value == "Green"]) >= 3,
    # Condition 3
    ball_number[value == "Yellow"] != 5
  ) %>%
  select(-ball_number) %>% 
  pivot_wider(values_from = "value", names_from = "name")

The output shows that there are 10 permutations:

# A tibble: 10 x 6
# Groups:   id [10]
      id ball1 ball2 ball3  ball4  ball5 
   <int> <chr> <chr> <chr>  <chr>  <chr> 
 1     8 Red   Blue  Orange Yellow Green 
 2     9 Red   Blue  Yellow Orange Green 
 3    32 Red   Green Yellow Orange Blue  
 4    33 Red   Green Orange Yellow Blue  
 5    48 Green Red   Orange Yellow Blue  
 6    49 Green Red   Yellow Orange Blue  
 7    50 Green Red   Yellow Blue   Orange
 8   111 Blue  Red   Yellow Green  Orange
 9   112 Blue  Red   Yellow Orange Green 
10   113 Blue  Red   Orange Yellow Green

An improvement this solution offers is that all the conditions you want to check for are very straightforward because of our variable ball_number. You could easily scale this solution to more complex conditions if there were more balls, such as the first 5 balls being red, or the blue ball plus green ball being equal to 7.

Upvotes: 1

Necroticka

Reputation: 285

If you are not so concerned about the data.frame structure, my preferred approach would be to retain each outcome as the member of a list (i.e. your d variable) and sapply() that with a function which checks that all of the conditions are satisfied for that outcome.

Observe:

library(combinat)

my_list <- c("Red", "Blue", "Green", "Yellow", "Orange")
my_list_perm <- combinat::permn(my_list) 

# This function examines one particular outcome of the trial, e.g. outcome = ["Blue", "Orange", "Red", "Green", "Yellow"]
test_conditions <- function(outcome) {
  
  # Condition 1
  condition_1 <- "Red" %in% outcome[c(1,2)]
  
  # Condition 2
  condition_2 <- base::abs(base::which(outcome == "Blue") - base::which(outcome == "Green")) >= 2
  
  # Condition 3
  condition_3 <- base::which(outcome == "Yellow") != base::length(outcome)
  
  all <- condition_1 && condition_2 && condition_3
  
  return(all)
}

my_list_matches <- base::which(base::sapply(my_list_perm, test_conditions)) # applies the function to each list element (which itself is an outcome)

print(my_list_matches) # displays which trials / outcomes satisfied all conditions

#>  [1]   6   7   8   9  10  12  19  22  29  31  32  33  34  35  41  48  49  50 111 112 113 120

^{Created on 2022-01-04 by the reprex package (v1.0.0)}

You can then use the matching indices to filter on the original list.

Upvotes: 1

Vishal A.

Reputation: 1391

Here's what you can do. I know it's not the prettiest and most optimized solution you'll find out there. But it works!

all_combinations  = as.data.frame(matrix(unlist(d), ncol = 5)) %>%
  setNames(paste0("col", 1:5))

cond_1 <- all_combinations %>%
  filter(col1 == "Red" | col2 == "Red")


cond_2 <- cond_1 %>%
    filter(col1 == "Blue" | col1 == "Green" |
             col2 == "Blue" | col2 == "Green" |
             col3 == "Blue" | col3 == "Green" |
             col4 == "Blue" | col4 == "Green" |
             col5 == "Blue" | col5 == "Green")

cond_2 <- cond_2 %>%
  mutate(cond = ifelse(col1 == 'Blue' & col4 == 'Green', 2, NA) |
           ifelse(col1 == 'Blue' & col5 == 'Green', 3, NA) |
           ifelse(col2 == 'Blue' & col5 == 'Green', 2, NA) |
           ifelse(col1 == 'Green' & col4 == 'Blue', 2, NA) |
           ifelse(col2 == 'Green' & col5 == 'Blue', 3, NA)) %>%
  filter(cond == T)


cond_3 <- cond_2%>%
  filter(col5 != "Yellow")

Output:

  col1 col2   col3  col4 col5 cond
1 Blue  Red Orange Green  Red TRUE

Upvotes: 1

deschen

Reputation: 11016

You can do:

library(tidyverse)
tpose %>%
  mutate(blue_delete = case_when(V1 == "Blue" & V2 == "Green" ~ TRUE,
                                 V1 == "Blue" & V3 == "Green" ~ TRUE,
                                 V2 == "Blue" & V3 == "Green" ~ TRUE,
                                 V3 == "Blue" & V4 == "Green" ~ TRUE,
                                 V4 == "Blue" & V5 == "Green" ~ TRUE,
                                 TRUE ~ FALSE)) %>%
  filter(V3 != "Red" & V4 != "Red" & V5 != "Red",
         V5 != "Yellow",
         blue_delete == FALSE) %>%
  select(-blue_delete)

Upvotes: 1

Filtering a Data Frame based on Row Conditions

Answers (6)

Related Questions