Reputation: 42080
Suppose I have a data frame with a column for values and another column for the number of times that value was observed:
x <- data.frame(value=c(1,2,3), count=c(4,2,1))
x
# value count
# 1 1 4
# 2 2 2
# 3 3 1
I know that I can get the weighted mean of the data using weighted.mean
and the weighted median using the weighted.median
function provided by several packages (e.g. limma
), but how can I get other weighted statistics on my data, such as 1st and 3rd quartiles, and maybe standard deviation? "Expanding" the data using rep
is not an option because sum(x$count)
is about 3 billion (the size of the human genome).
Upvotes: 4
Views: 5844
Reputation: 42080
For completeness, I'll note that the S4Vectors package in Bioconductor provides an answer in the form of the "Rle" class, which lets you construct a run-length encoded vector that supports all the usual operations:
library(S4Vectors)
x <- data.frame(value=c(1,2,3), count=c(4,2,1))
y <- Rle(x$value, x$count)
mean(y)
median(y)
quantile(y)
Upvotes: 0
Reputation: 7251
To complete the answer
by Prasad Chalasani,
here is the code to complete the weighted median given
a column for values
and another column for the number of times that value was observed.
Note that it uses the wtd.quantile
function from the Hmisc
package.
require(Hmisc)
x <- data.frame(value=c(1,2,3), count=c(4,2,1))
## value count
## 1 1 4
## 2 2 2
## 3 3 1
wtd.quantile(x$value, x$count, probs = 0.5)
## 50%
## 1
Upvotes: 1
Reputation: 36110
Or try to back-transform it, and run the analysis the usual way:
dtf <- data.frame(value = 1:3, count = c(4, 2, 1))
x <- with(dtf, rep(value, count))
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 1.571 2.000 3.000
fivenum(x)
[1] 1 1 1 2 3
Upvotes: 1
Reputation: 20282
Have you tried these packages:
Hmisc
-- it has several weighted statistics, including weighted quantiles
laeken
-- it has weighted quantiles.
Upvotes: 7