Adamm
Adamm

Reputation: 2306

Select values when column names are stored as concatenated strings

It's hard to explain, so I'll start with an example. I have some numeric columns (A, B, C). The column 'tmp' contains variable names of the numeric columns as concatenated strings:

set.seed(100)
A <- floor(runif(5, min=0, max=10))
B <- floor(runif(5, min=0, max=10))
C <- floor(runif(5, min=0, max=10))
tmp <- c('A','B,C','C','A,B','A,B,C')

df <- data.frame(A,B,C,tmp)

  A B C   tmp
1 3 4 6     A
2 2 8 8   B,C
3 5 3 2     C
4 0 5 3   A,B
5 4 1 7 A,B,C

Now, for each row, I want to use the variable names in tmp to select the values from the corresponding numeric columns with the same name(s). Then I want to keep only the rows where all the selected values are less than or equal 3.

E.g. in the first row, tmp is A, and the corresponding value in column A is 3, i.e. keep this row.

Another example, in row 4, tmp is A,B. The corresponding values are A = 0 and B = 5. Thus, all selected values are not less than or equal 3, and this row is discarded.

Desired result:

  A B C   tmp
1 3 4 6     A
2 5 3 2     C

How can I perform such filtering?

Upvotes: 2

Views: 94

Answers (3)

Henrik
Henrik

Reputation: 67778

This answer has some similarities with @Roland's, but here we work with the data in a 'longer' format:

# create row index
df$ri = seq_len(nrow(df))

# split the concatenated column
l <- strsplit(df$tmp, ',')

# repeat each row of the data with the lengths of the split string,
# bind with individual strings
d = cbind(df[rep(1:nrow(df), lengths(l)), ], x = unlist(l))

# use match to grab values from corresponding columns
d$val <- d[cbind(seq(nrow(d)), match(d$x, names(d)))]

# for each original row 'ri', check if all values are <= 3. use result to index data frame
d[as.logical(ave(d$val, d$ri, FUN = function(x) all(x <= 3))), ]
#   A B C tmp ri x val
# 1 3 4 6   A  1 A   3
# 3 5 3 2   C  3 C   2

Upvotes: 1

Not sure if it works always (and probably isn't the best solution)... but it worked here:

library(dplyr)
library(tidyr)
library(stringr)

        List= vector("list")
    
    for (i in 1:length(df)){
      
      tmpT= as.vector(str_split(df$tmp[i], ",", simplify=TRUE))
      
      selec= df %>% 
        select(tmpT) %>%
        slice(which(row_number() == i)) %>% 
        filter_all(., all_vars(. <= 3)) %>% 
        unite(val, sep= ", ")
      
    if (nrow(selec) == 0) {
      tab= NA
    } else{ 
        tab= df[i,]
    }
      
    List[[i]] = tab
      
    }
    
    df2= do.call("rbind", List)

Upvotes: 1

Roland
Roland

Reputation: 132576

This is a bit more complicated than I like and there might be a more elegant solution, but here we go:

#split tmp
col <- strsplit(df[["tmp"]], ",")

#create an index matrix
inds <- do.call(rbind, Map(data.frame, row = seq_along(col), col = col))
inds$col <- match(inds$col, names(df))
inds <- as.matrix(inds)

#check
chk <- m <- as.matrix(df[, names(df) != "tmp"])
mode(chk) <- "logical"
chk[] <- NA
chk[inds] <- m[inds] <= 3
sel <- apply(chk, 1, prod, na.rm = TRUE)

df[as.logical(sel),]
#  A B C tmp
#1 3 4 6   A
#3 5 3 2   C  

Upvotes: 1

Related Questions