s1x
s1x

Reputation: 574

Compare value in vector with all other vectors

Let's assume following dataset:

+---------------+-----------+---------------------+
| flightCarrier | saleTotal | daysBeforeDeparture |
+---------------+-----------+---------------------+
| KL            | 477.99    |                   0 |
| AF            | 457.99    |                   0 |
| SQ            | 556.31    |                   0 |
+---------------+-----------+---------------------+

What I'd like to do is the following:

  1. Compare a value in a column to all other values in the same column.
  2. Is saleTotal(1) is smaller than the value of saleTotal(2) and saleTotal(3)
  3. If yes, by how much? saleTotal(3)/saleTotal(1)

Workorder:

What I've tried so far:

cal <- apply(df_matrix[1:2,2], 1, function(x) {
  A <- x
  x <- x[-1]
  ifelse(x>A, 1, ifelse(x<A, 0, NA))
})
cal

This didn't worked out and prints out "logical(0)" so I guess no results. I tried lots of ways, with lapply, mapply but all seemed to go comparing static numbers instead of previous rows.

What I grasped from apply yet, is that every X there is the row its "iterating". Thats why I tried to compare X>A while A is the whole vector with all saleTotal values. Thus, iterating through each of one.


Expected Output Business Output: "Price is cheaper than XY other prices"

I guess this would be the best way to avoid large matrices and keep memory as low as possible) Is there might be a way to "nrow()" the results directly rather than creating a matrices/list first?

+-----------+-------------+
| saleTotal | cheaperThan |
+-----------+-------------+
| 477.99    |           1 |
| 457.99    |           2 |
| 556.31    |           0 |
+-----------+-------------+

Any idea how to do this? What about performance, I have 100000+ rows?

EDIT: expected output (one way)

Upvotes: 3

Views: 150

Answers (2)

talat
talat

Reputation: 70256

You can use ?outer like this:

outer(df$saleTotal, df$saleTotal, "/")
#          [,1]     [,2]      [,3]
#[1,] 1.0000000 1.043669 0.8592152
#[2,] 0.9581581 1.000000 0.8232640
#[3,] 1.1638528 1.214677 1.0000000

Values greater 1 indicate increase, values less than 1 indicate decrease and the diagonal of the matrix is all 1s since it compares each value to itself.

Of course you could modify this to only show values greater than 1, for example by using:

res <- outer(df$saleTotal, df$saleTotal, "/")
res * as.integer(res > 1)
#         [,1]     [,2] [,3]
#[1,] 0.000000 1.043669    0
#[2,] 0.000000 0.000000    0
#[3,] 1.163853 1.214677    0

Or, if you just want a logical matrix:

res > 1
#      [,1]  [,2]  [,3]
#[1,] FALSE  TRUE FALSE
#[2,] FALSE FALSE FALSE
#[3,]  TRUE  TRUE FALSE

Upvotes: 5

Tensibai
Tensibai

Reputation: 15784

See note at end about efficiency

With your expected output you may iterate on each value and count (sum the TRUE values) how many time this value is cheaper than all the others values and return a list to 'pair' the value with the count:

sapply(data[,2],function(x) {
  list(x, sum(x < data[,2]))
})

which gives in long format:

     [,1]   [,2]   [,3]  
[1,] 477.99 457.99 556.31
[2,] 1      2      0     

In case you just wish to add a column to your existing dataset this should do:

data$cheaperThan <- sapply(data[,2],function(x) sum(x < data[,2])) 

Data used:

> system.time(sapply(large,function(x) sum(x < large)))
utilisateur     système      écoulé 
       1.08        0.22        1.30 
> system.time(length(large) - findInterval(large,sort(large)))
utilisateur     système      écoulé 
       0.01        0.00        0.01 

@alexis_laz solution if really really really more efficient:

> set.seed(123)
> test <- runif(50000)*100
> identical(sapply(test,function(x) sum(x < test)), (length(test) - findInterval(test,sort(test))))
[1] TRUE
> system.time(sapply(test,function(x) sum(x < test)))
utilisateur     système      écoulé 
      13.64        1.24       14.96 
> system.time(length(test) - findInterval(test,sort(test)))
utilisateur     système      écoulé 
       0.01        0.00        0.02

Upvotes: 5

Related Questions