Reputation: 20107
My data is in the form of count/histogram data, ie a list of vectors, where each vector has "value" and then "count":
[[1]]
[1] 1 34
[[2]]
[1] 2 233799
[[3]]
[1] 3 28129
[[4]]
[1] 4 42549
[[5]]
[1] 5 52264
This means, for example, that a score of 1 happened 34 times, a score of 2 happened 233799 times etc.
Now, I want to be able to calculate arbitrary summary statistics on this data, like mean
, median
, sd
etc. However, with the current format of the data that isn't possible. I would like to know if there is a function, datatype, or package for handling this kind of data, that would allow me to apply common statistics to this data.
My current hacky workaround is to duplicate the first number n
times, where n
is the count:
flatten_dbl(purrr::map(data, function(datum) {
# If the histogram has the coordinate 10, 20 it means we have seen the
# number 10, 20 times, so we duplicate 10, 20 times
rep(datum[[1]], datum[[2]])
}))
While this works, and returns me a single vector that I can apply statistics to, it's extremely slow and memory intensive because the vector ends up with millions of values. I would like a solution that doesn't require this.
Upvotes: 1
Views: 165
Reputation: 20107
I wasn't able to find an existing core library or CRAN package to solve this problem, so I developed HistDat
, which is available on CRAN: https://cran.r-project.org/package=HistDat
Upvotes: 1