Reputation: 574
Let's assume following dataset:
+---------------+-----------+---------------------+ | flightCarrier | saleTotal | daysBeforeDeparture | +---------------+-----------+---------------------+ | KL | 477.99 | 0 | | AF | 457.99 | 0 | | SQ | 556.31 | 0 | +---------------+-----------+---------------------+
What I'd like to do is the following:
Workorder:
What I've tried so far:
cal <- apply(df_matrix[1:2,2], 1, function(x) {
A <- x
x <- x[-1]
ifelse(x>A, 1, ifelse(x<A, 0, NA))
})
cal
This didn't worked out and prints out "logical(0)" so I guess no results. I tried lots of ways, with lapply, mapply but all seemed to go comparing static numbers instead of previous rows.
What I grasped from apply yet, is that every X there is the row its "iterating". Thats why I tried to compare X>A while A is the whole vector with all saleTotal values. Thus, iterating through each of one.
Expected Output Business Output: "Price is cheaper than XY other prices"
I guess this would be the best way to avoid large matrices and keep memory as low as possible) Is there might be a way to "nrow()" the results directly rather than creating a matrices/list first?
+-----------+-------------+ | saleTotal | cheaperThan | +-----------+-------------+ | 477.99 | 1 | | 457.99 | 2 | | 556.31 | 0 | +-----------+-------------+
Any idea how to do this? What about performance, I have 100000+ rows?
EDIT: expected output (one way)
Upvotes: 3
Views: 150
Reputation: 70256
You can use ?outer
like this:
outer(df$saleTotal, df$saleTotal, "/")
# [,1] [,2] [,3]
#[1,] 1.0000000 1.043669 0.8592152
#[2,] 0.9581581 1.000000 0.8232640
#[3,] 1.1638528 1.214677 1.0000000
Values greater 1 indicate increase, values less than 1 indicate decrease and the diagonal of the matrix is all 1s since it compares each value to itself.
Of course you could modify this to only show values greater than 1, for example by using:
res <- outer(df$saleTotal, df$saleTotal, "/")
res * as.integer(res > 1)
# [,1] [,2] [,3]
#[1,] 0.000000 1.043669 0
#[2,] 0.000000 0.000000 0
#[3,] 1.163853 1.214677 0
Or, if you just want a logical matrix:
res > 1
# [,1] [,2] [,3]
#[1,] FALSE TRUE FALSE
#[2,] FALSE FALSE FALSE
#[3,] TRUE TRUE FALSE
Upvotes: 5
Reputation: 15784
See note at end about efficiency
With your expected output you may iterate on each value and count (sum the TRUE
values) how many time this value is cheaper than all the others values and return a list to 'pair' the value with the count:
sapply(data[,2],function(x) {
list(x, sum(x < data[,2]))
})
which gives in long format:
[,1] [,2] [,3]
[1,] 477.99 457.99 556.31
[2,] 1 2 0
In case you just wish to add a column to your existing dataset this should do:
data$cheaperThan <- sapply(data[,2],function(x) sum(x < data[,2]))
Data used:
> system.time(sapply(large,function(x) sum(x < large)))
utilisateur système écoulé
1.08 0.22 1.30
> system.time(length(large) - findInterval(large,sort(large)))
utilisateur système écoulé
0.01 0.00 0.01
@alexis_laz solution if really really really more efficient:
> set.seed(123)
> test <- runif(50000)*100
> identical(sapply(test,function(x) sum(x < test)), (length(test) - findInterval(test,sort(test))))
[1] TRUE
> system.time(sapply(test,function(x) sum(x < test)))
utilisateur système écoulé
13.64 1.24 14.96
> system.time(length(test) - findInterval(test,sort(test)))
utilisateur système écoulé
0.01 0.00 0.02
Upvotes: 5