Reputation: 1309
I have a numeric vector v (with already omitted NAs) and want to get the nth largest values and their respective frequencies.
I found http://gallery.rcpp.org/articles/top-elements-from-vectors-using-priority-queue/ to be quite fast.
// [[Rcpp::export]]
std::vector<int> top_i_pq(NumericVector v, unsigned int n)
{
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
return result ;
}
However ties will not be respected. In fact I don't need the indices, returning the values would also be ok.
What I would like to get is a list containing the values and the frequencies, say something like:
numv <- c(4.2, 4.2, 4.5, 0.1, 4.4, 2.0, 0.9, 4.4, 3.3, 2.4, 0.1)
top_i_pq(numv, 3)
$lengths
[1] 2 2 1
$values
[1] 4.2 4.4 4.5
Neither getting a unique vector, nor a table, nor a (full) sort is a good idea, as n is usually small compared to the length of v (which might easily be >1e6).
Solutions so far are:
library(microbenchmark)
library(data.table)
library(DescTools)
set.seed(1789)
x <- sample(round(rnorm(1000), 3), 1e5, replace = TRUE)
n <- 5
microbenchmark(
BaseR = tail(table(x), n),
data.table = data.table(x)[, .N, keyby = x][(.N - n + 1):.N],
DescTools = Large(x, n, unique=TRUE),
Coatless = ...
)
Unit: milliseconds
expr min lq mean median uq max neval
BaseR 188.09662 190.830975 193.189422 192.306297 194.02815 253.72304 100
data.table 11.23986 11.553478 12.294456 11.768114 12.25475 15.68544 100
DescTools 4.01374 4.174854 5.796414 4.410935 6.70704 64.79134 100
Hmm, DescTools still fastest, but I'm sure it can be significantly improved by Rcpp (as it's pure R)!
Upvotes: 5
Views: 962
Reputation: 20746
Note: The previous version replicated functionality for table()
and not the target. This version has been removed and will be available off-site.
Below is a solution using a map
.
First of all, we need to find the "unique" values for the vector of numbers.
To do this, we opt to store the number being counted as a key
within a std::map
and increment the value
each time we observe that number.
Using the ordering structure of the std::map
, we know that the top n
numbers are at the back of the std::map
. Thus, we use an iterator to pop those elements and export them in an array.
If one has access to a C++11 compiler, an alternative is to use std::unordered_map
, which has a Big O of O(1)
for insertion and retrieval ( O(n)
if bad hashes) vs. std::map
which has a Big O of O(log(n))
.
To obtain the correct top n
, one would then use std::partial_sort()
to do so.
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::List top_n_map(const Rcpp::NumericVector & v, int n)
{
// Initialize a map
std::map<double, int> Elt;
Elt.clear();
// Count each element
for (int i = 0; i != v.size(); ++i) {
Elt[ v[i] ] += 1;
}
// Find out how many unique elements exist...
int n_obs = Elt.size();
// If the top number, n, is greater than the number of observations,
// then drop it.
if(n > n_obs ) { n = n_obs; }
// Pop the last n elements as they are already sorted.
// Make an iterator to access map info
std::map<double,int>::iterator itb = Elt.end();
// Advance the end of the iterator up to 5.
std::advance(itb, -n);
// Recast for R
Rcpp::NumericVector result_vals(n);
Rcpp::NumericVector result_keys(n);
unsigned int count = 0;
// Start at the nth element and move to the last element in the map.
for( std::map<double,int>::iterator it = itb; it != Elt.end(); ++it )
{
// Move them into split vectors
result_keys(count) = it->first;
result_vals(count) = it->second;
count++;
}
return Rcpp::List::create(Rcpp::Named("lengths") = result_vals,
Rcpp::Named("values") = result_keys);
}
Let's verify that it works by running over some data:
# Set seed for reproducibility
set.seed(1789)
x <- sample(round(rnorm(1000), 3), 1e5, replace = TRUE)
n <- 5
And now we seek to obtain the occurrence information:
# Call our function
top_n_map(a)
Gives us:
$lengths
[1] 101 104 101 103 103
$values
[1] 2.468 2.638 2.819 3.099 3.509
Unit: microseconds
expr min lq mean median uq max neval
BaseR 112750.403 115946.7175 119493.4501 117676.2840 120712.595 166067.530 100
data.table 6583.851 6994.3665 8311.8631 7260.9385 7972.548 47482.559 100
DescTools 3291.626 3503.5620 5047.5074 3885.4090 5057.666 43597.451 100
Coatless 6097.237 6240.1295 6421.1313 6365.7605 6528.315 7543.271 100
nrussel_c98 513.932 540.6495 571.5362 560.0115 584.628 797.315 100
nrussel_c11 489.616 512.2810 549.6581 533.2950 553.107 961.221 100
As we can see, this implementation beats out data.table
, but falls victim to DescTools and @nrussel's attempts.
Upvotes: 1
Reputation: 18612
I'd like to throw my hat in the ring with another Rcpp-based solution, which is ~7x faster than the DescTools
approach and ~13x faster than the data.table
approach, using the 1e5-length x
and n = 5
sample data above. The implementation is a bit lengthy, so I'll lead with the benchmark:
fn.dt <- function(v, n) {
data.table(v = v)[
,.N, keyby = v
][(.N - n + 1):.N]
}
microbenchmark(
"DescTools" = Large(x, n, unique=TRUE),
"top_n" = top_n(x, 5),
"data.table" = fn.dt(x, n),
times = 500L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# DescTools 3330.527 3790.035 4832.7819 4070.573 5323.155 54921.615 500
# top_n 566.207 587.590 633.3096 593.577 640.832 3568.299 500
# data.table 6920.636 7380.786 8072.2733 7764.601 8585.472 14443.401 500
Update
If your compiler supports C++11, you can take advantage of std::priority_queue::emplace
for a (surprisingly) noticeable performance boost (compared to the C++98 version below). I won't post this version as it is mostly identical, save for a few calls to std::move
and emplace
, but here's a link to it.
Testing this against the previous three functions, and using data.table
1.9.7 (which is a bit faster than 1.9.6) yields
print(res2, order = "median", signif = 3)
# Unit: relative
# expr min lq mean median uq max neval cld
# top_n2 1.0 1.00 1.000000 1.00 1.00 1.00 1000 a
# top_n 1.6 1.58 1.666523 1.58 1.75 2.75 1000 b
# DescTools 10.4 10.10 8.512887 9.68 7.19 12.30 1000 c
# data.table-1.9.7 16.9 16.80 14.164139 15.50 10.50 43.70 1000 d
where top_n2
is the C++11 version.
The top_n
function is implemented as follows:
#include <Rcpp.h>
#include <utility>
#include <queue>
class histogram {
private:
struct paired {
typedef std::pair<double, unsigned int> pair_t;
pair_t pair;
unsigned int is_set;
paired()
: pair(pair_t()),
is_set(0)
{}
paired(double x)
: pair(std::make_pair(x, 1)),
is_set(1)
{}
bool operator==(const paired& other) const {
return pair.first == other.pair.first;
}
bool operator==(double other) const {
return is_set && (pair.first == other);
}
bool operator>(double other) const {
return is_set && (pair.first > other);
}
bool operator<(double other) const {
return is_set && (pair.first < other);
}
paired& operator++() {
++pair.second;
return *this;
}
paired operator++(int) {
paired tmp(*this);
++(*this);
return tmp;
}
};
struct greater {
bool operator()(const paired& lhs, const paired& rhs) const {
if (!lhs.is_set) return false;
if (!rhs.is_set) return true;
return lhs.pair.first > rhs.pair.first;
}
};
typedef std::priority_queue<
paired,
std::vector<paired>,
greater
> queue_t;
unsigned int sz;
queue_t queue;
void insert(double x) {
if (queue.empty()) {
queue.push(paired(x));
return;
}
if (queue.top() > x && queue.size() >= sz) return;
queue_t qtmp;
bool matched = false;
while (queue.size()) {
paired elem = queue.top();
if (elem == x) {
qtmp.push(++elem);
matched = true;
} else {
qtmp.push(elem);
}
queue.pop();
}
if (!matched) {
if (qtmp.size() >= sz) qtmp.pop();
qtmp.push(paired(x));
}
std::swap(queue, qtmp);
}
public:
histogram(unsigned int sz_)
: sz(sz_),
queue(queue_t())
{}
template <typename InputIt>
void insert(InputIt first, InputIt last) {
for ( ; first != last; ++first) {
insert(*first);
}
}
Rcpp::List get() const {
Rcpp::NumericVector values(sz);
Rcpp::IntegerVector freq(sz);
R_xlen_t i = 0;
queue_t tmp(queue);
while (tmp.size()) {
values[i] = tmp.top().pair.first;
freq[i] = tmp.top().pair.second;
++i;
tmp.pop();
}
return Rcpp::List::create(
Rcpp::Named("value") = values,
Rcpp::Named("frequency") = freq);
}
};
// [[Rcpp::export]]
Rcpp::List top_n(Rcpp::NumericVector x, int n = 5) {
histogram h(n);
h.insert(x.begin(), x.end());
return h.get();
}
There's a lot going on in the histogram
class above, but just to touch on some of the key points:
paired
type is essentially a wrapper class around an std::pair<double, unsigned int>
, which associates a value with a count, providing some convenience features such as operator++()
/ operator++(int)
for direct pre-/post-increment of the count, and modified comparison operators. histogram
class wraps a sort of "managed" priority queue, in the sense that the size of std::priority_queue
is capped at a particular value sz
. std::less
ordering of std::priority_queue
, I'm using a greater-than comparator so that candidate values can be checked against std::priority_queue::top()
to quickly determine whether they should (a) be discarded, (b) replace the current minimum value in the queue, or (c) update the count of one of the existing values in the queue. This is only possible because the size of the queue is being restricted to <= sz
. Upvotes: 5
Reputation: 34763
I'd wager data.table
is competitive:
library(data.table)
data <- data.table(v)
data[ , .N, keyby = v][(.N - n + 1):.N]
where n
is the number you want to get
Upvotes: 4