zw324
zw324

Reputation: 27180

How could I make this R snippet faster and more R-ish?

Coming from various other languages, I find R powerful and intuitive, but I am not thrilled with its performance. So I decided to try to improve some snippet I wrote and learn how to code better in R.

Here's a function I wrote, trying to determine if a vector is binary-valued (two distinct values or just one value) or not:

isBinaryVector <- function(v) {
  if (length(v) == 0) {
    return (c(0, 1))
  }
  a <- v[1]
  b <- a
  lapply(v, function(x) { if (x != a && x != b) {if (a != b) { return (c()) } else { b = x }}})
  if (a < b) {
    return (c(a, b))
  } else {
    return (c(b, a))
  }
}

EDIT: This function is expected to look through a vector then return c() if it is not binary-valued, and return c(a, b) if it is, a being the small value and b being the larger one (if a == b then just c(a, a). E.g., for

  A B C
1 1 1 0
2 2 2 0
3 3 1 0

I will lapply this isBinaryVector and get:

$A
[1] 1 1

$B
[1] 1 1

$C
[1] 0 0

The time it took on a moderate sized dataset (about 1800 * 3500, 2/3 of them are binary-valued) is about 15 seconds. The set contains only floating-point numbers.

Is there anyway I could do this faster?

Thanks for any inputs!

Upvotes: 2

Views: 140

Answers (2)

Andrie
Andrie

Reputation: 179428

You are essentially trying to write a function that returns TRUE if a vector has exactly two unique values, and FALSE otherwise.

Try this:

> dat <- data.frame(
+   A = 1:3,
+   B = c(1, 2, 1), 
+   C = 0
+ )
> 
> sapply(dat, function(x)length(unique(x))==2)
    A     B     C 
FALSE  TRUE FALSE 

Next, you want to get the min and max value. The function range does this. So:

> sapply(dat, range)
     A B C
[1,] 1 1 0
[2,] 3 2 0

And there you have all the ingredients to make a small function that is easy to understand and should be extremely quick, even on large amounts of data:

isBinary <- function(x)length(unique(x))==2

binaryValues <- function(x){
  if(isBinary(x)) range(x) else NA
}

sapply(dat, binaryValues)

$A
[1] NA

$B
[1] 1 2

$C
[1] NA

Upvotes: 8

Justin
Justin

Reputation: 43255

This function returns true or false for vectors (or columns of a data frame):

is.binary <- function(v) {
  x <- unique(v)
  length(x) - sum(is.na(x)) == 2L
}

Also take a look at this post

I'd use something like that to get column indicies:

bivalued <- apply(my.data.frame, 2, is.binary)

nominal <- my.data.frame[,!bivalued]
binary <- my.data.frame[,bivalued]

Sample data:

my.data.frame <- data.frame(c(0,1), rnorm(100), c(5, 19), letters[1:5], c('a', 'b'))
> apply(my.data.frame, 2, is.binary)
     c.0..1.   rnorm.100.     c.5..19. letters.1.5.  c..a....b.. 
        TRUE        FALSE         TRUE        FALSE         TRUE 

Upvotes: 4

Related Questions