Reputation: 2341

Removing many entries from a vector of strings by name

I would like to remove about 100 entries in a vector with 500 column names and subsequently use that vector to put the rows of a (prediction) matrix m to zero.

As a very simple example of my dataframe:

First I put the column names into a vector:

x <- colnames(df) # x <- c("A","B","C","D","E","F","G,"H","I","J")

Let's say I would like to remove B until D, F, and G until I (which are actually about 100 variables scattered over the vector, for which I do not know their index). I would like to do something like:

*remove <- c(B:D, F, G:I)* # This does now work obviously
x [! x %in% remove]

Which would leave me with a vector x as follows:

A
E
J

This vector represents the rownames (and colnames because it is a prediction matrix) which needs to be set to zero:

m[x,] <- 0

Creating the following output:

  A B C D E F G H
A 1 0 1 0 1 0 1 0
B 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0
E 1 0 1 0 1 0 1 0
F 1 0 1 0 1 0 1 0
G 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0
J 1 0 1 0 1 0 1 0

How can I remove these 100 variable names from the vector of all variables names and use that vector to refer to column names of a matrix?

Upvotes: 0

Answers (3)

Tom

Reputation: 2341

I got it to work using hrbrmstr's answer and a long workaround. If anyone can tell me how to do this less messy, please let me know.

# Copy prediction matrix and turn it into a dataframe for the "remove rows" function
varlist <- m
varlist <- as.data.frame(varlist)

# Create a column called "cat" with the rownames for the "remove rows" function
varlist$cat = rownames(varlist)
# Use the function to remove the rows from the copied df
varlist <- remove_rows(varlist, cat, ~B:C+F+G:I)
# Only keep the "cat" column and turn it into a vector
varlist <- varlist$cat
varlist <- varlist[['cat']]
# Copy prediction matrix and use "varlist" to put the correct rows to zero.
m_reduced <- m
m_reduced[ ,varlist] <- 0

I would be REALLY happy if someone could tell me how to clean up this monstrosity.

Upvotes: 1

Manos Papadakis

Reputation: 593

Here is my way:

remove<-function(lets_to_be_removed,names){
    letters_with_names<-1:length(LETTERS) # each value corresponds to a letter
    names(letters_with_names)<-LETTERS # the letters, for example: letters_with_name["A"]==1 is TRUE
    result<-integer()
    for(letters in lets_to_be_removed){
        #check if it is only one letter
        res <- if(length(letters) == 1) letters_with_names[letters] else letters_with_names[letters[1]]:letters_with_names[letters[2]] 
        result<- c(result,res)
    }
    names(result)<-LETTERS[result]
    result #return the indices of the letters
}

And you can call it with this way:

letters <- list(c("B","D"),"F",c("G","I"))
letters
[[1]]
[1] "B" "D" # B:D sequence
[[2]]
[1] "F" # only one letter
[[3]]
[1] "G" "I" # G:I sequence

indices<-remove(letters,x)
indices # named vector
B C D F G H I 
2 3 4 6 7 8 9

x[ -indices ] # it is faster than [! x %in% indices] but if you want your method  then use [! x %in% names(indices)]
[1] "A" "E" "J"

General it is better and faster to use for indices integers than characters.

Upvotes: 0

hrbrmstr

Reputation: 78792

Intriguing use-case. We can craft a function that will help you do this in the generic way you seem to desire.

NOTE:

I used a data frame below b/c I don't think there was (or I just missed it) a matrix mention initially and now various question edits are confusing columns and row names. SO the bits you should focus on from below are:

# get the terms of the formula
trms <- terms(remove_spec)

# get each element (will be each group separated by `+`
elements <- attr(trms, "term.labels")

# adding in assertions to validate `col` is in `xdf` and that only
# the restricted syntax is used in the formula and that it's valid 
# is up to the OP

# now, find the positions of all those strings
unlist(lapply(elements, function(y) {
  if (grepl(":", y)) {
    rng <- strsplit(y, ":")[[1]]
    which(x[,col] == rng[1]) : which(x[,col] == rng[2])
  } else {
    which(x[,col] == y)
  }
}), use.names = FALSE) -> to_exclude

as I'm kinda now done with this q (and rownames are so 1980s :-). Note the caveats at the end of the answer.

Others should feel free to use this in an actual matrix answer for the OP's use-case.

We'll craft some simulated data (that way I can make the example larger if you want a bigger example):

library(dplyr) # mostly for saner data frame constructor & printing

set.seed(2018-11-18)

data_frame(
  cat = LETTERS,
  val1 = sample(100, length(cat), replace = TRUE),
  val2 = sample(100, length(cat), replace = TRUE),
  val3 = sample(100, length(cat), replace = TRUE)
) -> xdf

xdf
## # A tibble: 26 x 4
##    cat    val1  val2  val3
##    <chr> <int> <int> <int>
##  1 A        87    98     5
##  2 B        30    69    39
##  3 C        87     1    32
##  4 D        65    46    87
##  5 E         4    69     6
##  6 F        53    20    31
##  7 G        43    51    84
##  8 H        27    43    65
##  9 I        27     9    10
## 10 J        10    94    11
## # ... with 16 more rows

(tibble printing is def >> base printing IMO, but I digress).

Now, you want to use strings to specify both individual elements and ranges and have something that figures out what to do under the covers. We'll need a function for that and we can take advantage of a special R class — forumla — to help with a more compact syntax. i.e. wouldn't it be nice to be able to call a function like this:

remove_rows(xdf, cat, ~B:C+F+G:I)

which would seek out the range of "B":"C" in the cat column in xdf, find the position of "F" and then the range of "G":"I" and return a data frame with those excluded? Yes, yes it would. So, let's build it!

#' @param x data frame
#' @param col bare column name to use for the comparison
#' @param formula restricted operators are `:` for range and `+` for additing selectors
remove_rows <- function(x, col, remove_spec) {

  # this is pure convenience we could just as easily have forced folks 
  # to pass in a string (and we can modify it to handle both)
  col <- as.character(substitute(col)) 

  # get the terms of the formula
  trms <- terms(remove_spec)

  # get each element (will be each group separated by `+`
  elements <- attr(trms, "term.labels")

  # adding in assertions to validate `col` is in `xdf` and that only
  # the restricted syntax is used in the formula and that it's valid 
  # is up to the OP

  # now, find the positions of all those strings
  unlist(lapply(elements, function(y) {
    if (grepl(":", y)) {
      rng <- strsplit(y, ":")[[1]]
      which(x[,col] == rng[1]) : which(x[,col] == rng[2])
    } else {
      which(x[,col] == y)
    }
  }), use.names = FALSE) -> to_exclude

  # and get rid of those puppies
  x[-to_exclude,]

}

Now we can call it for reals:

remove_rows(xdf, cat, ~B:C+F+G:I)
## # A tibble: 20 x 4
##    cat    val1  val2  val3
##    <chr> <int> <int> <int>
##  1 A        87    98     5
##  2 D        65    46    87
##  3 E         4    69     6
##  4 J        10    94    11
##  5 K        37    86    52
##  6 L        89    64    44
##  7 M        61    10    28
##  8 N        79    52    89
##  9 O        71    33    77
## 10 P        45    33    77
## 11 Q        56    97    29
## 12 R        10    28    39
## 13 S        25     7    71
## 14 T        86    57    51
## 15 U        92     2    15
## 16 V        25    36    12
## 17 W        90    78    10
## 18 X        20    82    90
## 19 Y        39    84    13
## 20 Z        43    93    18

The function is named poorly so you may want to change that and you really should add in some parameter checking and validation, but I believe this does what you want (assuming you are really sure the data frame is in the order you believe it is).

Also, this is imperfect in that the strings are constrained to formula (one of said constraints is that they can't begin with a number without backtick quoting). But, you didn't provide a sample of the real strings.

Upvotes: 1

Removing many entries from a vector of strings by name

Answers (3)

NOTE:

Related Questions