Reputation: 2571

rank sum score calculation on a data.frame

I have a data.frame that looks like this:

 Name1    Name2    Name3   
   1        1         1    
  -1       -1         1
   1       -1         1   
   1       -1         1     
  -1       -1         1

I would like to perform a sort of rank-sum test for each column so that:

starting from the first element of each column (so for each list of my data.frame) if the second element is equal to the first (for ex: 1 and 1) the score will be increased by an unit because they are equals, otherwise the score will be decreased by an unity (because they are unequals, for ex: 1 and -1).

Ex: column "Name1"
first element = 1 : score = 1 (starting position)
second element = -1: score = 0 (1 unit is removed from the previous score (1) because 1 != -1)
third element = 1 : score = 1 (you are initializing the score to 1. Every time you initialize, the score is +1).
fourth element = 1 : score = 2 ( previous score 1 plus 1 unit because the third and the fourth elements are equals)
fifth element = -1: score = 1 (previous score 2 - 1 unit because fourth element != fifth element).

column "Name2"
first element = 1 : score = 1 (starting position)
second element = -1: score = 0 (1 unit is removed from the previous score (1) because 1 != -1)
third element = -1: score = 1 (you are reinitializing the score)
fourth element = -1: score = 2 (third element is equal to the fourth so the previous score will be increased by 1 unit)
fifth element = -1: score = 3 (fourth element is equal to the fifth one so the previous score, so 2, will be increased by 1 unit)

So the counter will increase or decrease the score of a number == 1 if the element in the ranking is equal or different by the previous one and it will be initialized to 1 every time it goes to 0.

The final goal is to give an higher score to the equals and consecutive elements in the rank respect to the random ones.

Can anyone help me please?

Upvotes: 1

Answers (5)

DrDom

Reputation: 4123

Probably this will help.

dat <- read.table(header=TRUE, text="
 Name1    Name2    Name3   
   1        1         1    
  -1       -1         1
   1       -1         1   
   1       -1         1     
  -1       -1         1
")

f <- function(x) {
  tail(cumsum(x), 1)
}

sapply(dat, f)

#Name1 Name2 Name3 
#    1    -3     5

And if you want to compare these results you may take abs values.

Upvotes: 0

eddi

Reputation: 49448

This is an answer to your subsequent question and not the first one, which I believe Matthew Plourde has answered to.

To get a measure of the rank you want, you could for instance count the sum of the lengths of pieces of your columns that have the same number more than once in a row. E.g in the example below you could add 3 and 2 and get a rank of 5.

x = c(1,-1,1,1,1,-1,-1)
rle(x)
#Run Length Encoding
#  lengths: int [1:4] 1 1 3 2
#  values : num [1:4] 1 -1 1 -1

To put it in a function:

rank = function(x) {
  x.rle = rle(x)
  sum(x.rle$lengths[x.rle$lengths > 1])
}

sapply(OP_dat, rank)
#Name1 Name2 Name3 
#    2     4     5

Upvotes: 1

IRTFM

Reputation: 263331

Add one to an equality test to construct an index of 1's and 2's to select from c(-1,1)

func <- function(x) 1+                  # your "starting position"
                    sum( c(-1, 1)[1+    # convert from 0/1 to 1/2
                                  (x[-1] == x[-length(x)]) ])

> sapply(dat, func)
Name1 Name2 Name3 
   -2     2     4

Upvotes: 0

Matthew Plourde

Reputation: 44614

If I've understood you correctly...

d <- read.table(text="Name1    Name2    Name3   
   1        1         1    
  -1       -1         1
   1       -1         1   
   1       -1         1     
  -1       -1         1", header=TRUE)


f1 <- function(score, pair) {
    if (score == 0) pair[1]
    else if (as.logical(diff(pair))) score - 1
    else score + 1
}

f2 <- function(col) {
    lagged <- embed(col, 2)
    Reduce(f1, split(lagged, seq(nrow(lagged))), init=1)
}

lapply(d, f2)
# $Name1
# [1] 1
# 
# $Name2
# [1] -1
# 
# $Name3
# [1] 5

Upvotes: 2

Ferdinand.kraft

Reputation: 12819

Consider this function:

f <- function(x)
{
  2 * sum(tail(x, -1)==head(x, -1)) - length(x) + 1
}

It computes the score you propose as the number of elements that are equal to the previous one minus the number of elements that differ. Since this last number is complementary to the first, the function can be written in the simplified form above.

Now if you want to apply that to all columns of a dataframe, simply use sapply:

dat <- read.table(header=TRUE, text="
 Name1    Name2    Name3   
   1        1         1    
  -1       -1         1
   1       -1         1   
   1       -1         1     
  -1       -1         1
")
sapply(dat, f)
# Name1 Name2 Name3 
#    -2     2     4

Upvotes: 0

rank sum score calculation on a data.frame

Answers (5)

Related Questions