Reputation: 351
I have this problem I was hoping someone could help.
I have a very large data frame (close to 20000000 observations)in R with about 43 columns, in four of those columns I need to find whether there is more than one equal minimum value below 200, then if we have rows where more than one column has the same value that meets this criteria I need to flag that row to TRUE (in a a new flag column). Please note that those columns include NA values, and NA
s should not be used (when NA is present in the columns being compared, returns NA)
the goal is to look up the values in each row for columns a1 to a4 and find whether the minimum value that does not exceed 200, occurs in more than one column per row
for simplicity let's say that this is how my data data look like
head(mydata)
t1 a1 a2 a3 a4
34 NA NA NA NA
26 10 15 250 150
34 20 20 100 30
35 5 5 10 5
25 45 100 3 45
31 400 310 500 310
")
the goal is to look up the values in each row for columns a1 to a4 and find whether the minimum value that does not exceed 200, occurs in more than one column per row, if it does return true if not, false
the expected result will look like this
head(mydata)
t1 a1 a2 a3 a4 flag
34 NA NA NA NA NA
26 10 15 250 150 FALSE
34 20 20 100 30 TRUE
35 5 5 10 5 TRUE
25 45 100 3 45 FALSE
31 400 310 500 310 FALSE
")
Thank you in advance.
Upvotes: 2
Views: 575
Reputation: 76402
If you have a large dataset, the following might be fast. It uses package matrixStats
, function rowMins
. See this answer.
icol <- grepl("^a", names(mydata))
min_row <- matrixStats::rowMins(as.matrix(mydata[icol]))
mydata$flag <- rowSums(mydata[icol] == min_row) > 1 & min_row < 200
Upvotes: 1
Reputation: 39858
One possibility involving dplyr
and purrr
:
df %>%
mutate(flag = exec(pmin, !!!.[-1]),
flag = rowSums(.[-1] == flag) > 1 & flag < 200)
t1 a1 a2 a3 a4 flag
1 34 NA NA NA NA NA
2 26 10 15 250 150 FALSE
3 34 20 20 100 30 TRUE
4 35 5 5 10 5 TRUE
5 25 45 100 3 45 FALSE
6 31 400 310 500 310 FALSE
Here it checks whether the occurrence of row-wise minimum is greater than 1 and whether the row-wise minimum is below 200.
Upvotes: 0
Reputation: 12074
Here's a purrr
solution. I create the data frame.
# Define data frame
df <- read.table(text = " t1 a1 a2 a3 a4
34 NA NA NA NA
26 10 15 250 150
34 20 20 100 30
35 5 5 10 5
25 45 100 3 45
31 400 310 500 310 ", header = TRUE)
Next, I load the library.
# Load library
library(purrr)
Then, I create the flag, running through each row using pmap_lgl
, which returns a logical. This line checks if there is more than one minimum value and that the minimum is below 200. The first column is omitted from the calculation. If there are NA
values in each row, an NA
will be created.
# Create flag
df$flag <- pmap_lgl(df, function(...)(sum(c(...)[-1] == min(c(...)[-1])) > 1) & min(c(...)[-1]) < 200)
This gives the following:
# Examine result
df
#> t1 a1 a2 a3 a4 flag
#> 1 34 NA NA NA NA NA
#> 2 26 10 15 250 150 FALSE
#> 3 34 20 20 100 30 TRUE
#> 4 35 5 5 10 5 TRUE
#> 5 25 45 100 3 45 FALSE
#> 6 31 400 310 500 310 FALSE
Created on 2019-05-31 by the reprex package (v0.3.0)
Upvotes: 0
Reputation: 511
Does this help you?:
mydata$flag=apply(mydata,1,function(x){ # iterate through rows
x=na.omit(x); # omit NAs in a row (optional)
tab=table(x[x<200]); # count numbers of all row values below 200
if(any(tab>1)){ # check if any values are not unique
return(TRUE)
}else{
return(FALSE)
}})
You can choose to include or omit NA
values or not.
Upvotes: 2
Reputation: 388982
Here is a base R way to do it
#Get the column indices where a1, a2, a3 and a4 are there
inds <- match(paste0("a", 1:4), names(df))
#Get row-wise minimum
min_val <- do.call(pmin, df[inds])
#Check if there are more than one occurrence of minimum value
# and if minimum value is less than 200.
df$flag <- rowSums(df[inds] == min_val) > 1 & min_val < 200
df
# t1 a1 a2 a3 a4 flag
#1 34 NA NA NA NA NA
#2 26 10 15 250 150 FALSE
#3 34 20 20 100 30 TRUE
#4 35 5 5 10 5 TRUE
#5 25 45 100 3 45 FALSE
#6 31 400 310 500 310 FALSE
Upvotes: 3