Reputation: 2070
I have a dataframe which holds two variables for a list of users. Such variables represents the number of posts and the number of threads opened by each user.
I'd like to test for a correlation between the two variables, and since the point is to test whether the more you posts the more you also open threads - plus the variables are not normally distributed, I opted for the Spearman correlation in order to assess the relationship between the two variables.
In order to do this, I need to rank my users according to how many posts and threads they have done, and I am stuck at this point. My dataset is a data frame like:
> data
USER SUM(POSTS) SUM(THREADS)
u0 2 2
u1 4 2
u10 212 25
u100 7 1
u102 226 23
u103 1 1
u104 3 1
u105 7 1
u107 234 28
What I have tried so far is to order and find the mean for repeated values with:
p<-ave(order(data[,2]), data[,2])
t<-ave(order(data[,3]), data[,3])
If I got the procedure right, which I may not, I expect threads to be ranked like:
4.5 4.5 2 7.5 3 7.5 7.5 7.5 1
but my code produces this ranking:
5.500000 5.500000 6.000000 4.333333 1.000000 4.333333 5.000000 4.333333 9.000000
Any help more than welcome!
Best, Simone
Upvotes: 2
Views: 2458
Reputation: 2144
Per droopy's comments, you can try something like:
data[,-1] <- apply(data[,-1], 2, function (x) {rank(1/rank(x))})
data
# USER SUM.POSTS SUM.THREADS
# 1 u0 8.0 4.5
# 2 u1 6.0 4.5
# 3 u10 3.0 2.0
# 4 u100 4.5 7.5
# 5 u102 2.0 3.0
# 6 u103 9.0 7.5
# 7 u104 7.0 7.5
# 8 u105 4.5 7.5
# 9 u107 1.0 1.0
As you see rank()
creates golf-style ranks, where lower ranks higher. I ranked the inverse, which appears to give the result you request. Hope this helps.
Upvotes: 2