Calculate summary statistics on histogram-like data in R

Question

My data is in the form of count/histogram data, ie a list of vectors, where each vector has "value" and then "count":

[[1]]
[1]  1 34

[[2]]
[1]      2 233799

[[3]]
[1]     3 28129

[[4]]
[1]     4 42549

[[5]]
[1]     5 52264

This means, for example, that a score of 1 happened 34 times, a score of 2 happened 233799 times etc.

Now, I want to be able to calculate arbitrary summary statistics on this data, like mean, median, sd etc. However, with the current format of the data that isn't possible. I would like to know if there is a function, datatype, or package for handling this kind of data, that would allow me to apply common statistics to this data.

My current hacky workaround is to duplicate the first number n times, where n is the count:

  flatten_dbl(purrr::map(data, function(datum) {
    # If the histogram has the coordinate 10, 20 it means we have seen the
    # number 10, 20 times, so we duplicate 10, 20 times
    rep(datum[[1]], datum[[2]])
  }))

While this works, and returns me a single vector that I can apply statistics to, it's extremely slow and memory intensive because the vector ends up with millions of values. I would like a solution that doesn't require this.

Migwell · Accepted Answer

I wasn't able to find an existing core library or CRAN package to solve this problem, so I developed HistDat, which is available on CRAN: https://cran.r-project.org/package=HistDat

Calculate summary statistics on histogram-like data in R

Answers (1)

Related Questions