Reputation: 1532
I have a large, strictly increasing array (10 million integers) of offsets for another, larger, data array. No element in data
is greater than 50. For example,
unsigned char data[70*1000*1000] = {0,2,1,1,0,2,1,4,2, ...};
unsigned int offsets[10*1000*1000] = {0,1,2,4,6,7,8, ...};
Then I would like to find the count of each element in a series of ranges that are not known until runtime, including only elements whose offsets are included in the offsets
array. The endpoints of each range refer to indices of the data array, not to the offsets. For example, the data for the range [1,4] would be:
1 zero
1 one
1 two
The results include only one "one" because, while both data[3]
and data[2]
are equal to one, 3 is not included in offsets
.
I need to compute these binned counts for several hundred ranges, some of which span the entire array. I considered iterating through the data array to store a cumulative sum for each bin and element, but the memory requirements would have been prohibitive. Here is a simple version of my implementation:
for(int i=0; i<range_count; i++){
unsigned int j=0;
while(j<range_starts[i]) pi++;
while(j < 10000000 and data[j]<=range_ends[i]) bins[i][data[offsets[j++]]]++;
}
Is there any more efficient way to compute these counts?
Upvotes: 2
Views: 1124
Reputation: 1532
While Ruben's answer did improve the time of the counts by about half, it remained too slow for my application. I include my solution here for the curious.
First, I optimized by setting elements in the data
array not indexed by offsets
to an unused value (51, for example). This removed the need to track offsets, because I could simply ignore the contents of the 51st bins when reporting results.
While I mentioned in the answer that storing cumulative counts for each bin and element would require too much memory, I was able to store the cumulative counts for each bin and range endpoint in linear time. Then, for each range, I calculated the occurrences of each element by subtracting the cumulative count for that element at the left endpoint of the range from the count at the right endpoint. Here is what I used:
struct range{
unsigned int lowerbound;
unsigned int upperbound;
unsigned int bins[52];
};
struct endpoint{
int n;
unsigned int counts[50];
};
range ranges[N_RANGES];
endpoint endpoints[N_RANGES*2];
cumulative_counts[52];
// ... < data manipulation > ...
endpoint* first_ep = &endpoints[0];
endpoint* last_ep = &endpoints[N_RANGES*2-1];
endpoint* next_ep;
for(next_ep=&endpoints[0];next_ep<last_ep;next_ep++){
unsigned char* i = &data[next_ep->n];
unsigned char* i_end = &data[(next_ep+1)->n];
for(int j=0;j<51;j++) next_ep->counts[j] = cumulative_counts[j];
while(i<i_end) cumulative_counts[*(i++)]++;
}
for(int i=0;i<51;i++) last_ep->sums[i] = cumulative_counts[i];
for(int i=0;i<N_RANGES;i++){
while(first_ep->n != ranges[i].lowerbound) first_ep++;
last_ep = first_ep+1;
while(last_ep->n != ranges[i].upperbound) last_ep++;
for(int j=0;j<51;j++) tests[i].bins[j] = end_ep->counts[j]-start_ep->counts[j];
ranges[i].bins[data[last_ep->n]]++;
}
Upvotes: 2
Reputation: 393537
Could this work.
(Demo live at http://ideone.com/6rAj7k)
#include <algorithm>
#include <iostream>
unsigned char data[/*70*1000*1000*/] = {0,2,1,1,0,2,1,4,2};
unsigned int offsets[/*10*1000*1000*/] = {0,1,2,4,6,7,8};
using namespace std;
void do_something_for_data_index(unsigned int data_index)
{
std::cout << "visited: " << (int) data[data_index] << " (at index " << data_index << ")\n";
}
void foo(size_t first_data_index, size_t high_data_index)
{
const auto low = lower_bound(begin(offsets), end(offsets), first_data_index);
const auto high = upper_bound(low , end(offsets), high_data_index);
for(auto offset_it = low; offset_it != high; ++offset_it)
{
do_something_for_data_index(*offset_it);
}
}
int main()
{
foo(1,4);
}
Output:
visited: 2 (at index 1)
visited: 1 (at index 2)
visited: 0 (at index 4)
Upvotes: 1
Reputation: 14778
It sounded like you already had your answer when you said your offsets are limited by 50 -- and they seem to be positive integers.
What about indexing a vector of vectors for each value of data, from 0 to 50, and then doing the other calculations would be way cheaper. This would be a sort of reverse index, from data to database entries.
So, you would have:
data[50][...] = {offsets related to the given data value}
The calculations would be performed checking, for each array, the initial element, and skipping from array to array, maintaining the position of the last element verified.
This would be linear on the number of elements of your entire array, times your search range, times the number of elements in the array "data" (0 to 50), which, considering you need to do this lots of times, wouldn't be the best approach.
Another approach, then, would be to use, for each data entry, from 0 to 50, a binary tree -- or even a hash structure --, so that you can now whether an database entry identifier belongs to the set of ids of the current data element (from 0 to 50). This would be, in the best case, linear on your search range, for each iteraction.
I've considered 50 as a constant on the analysis, so that searching only in the first data array, or in all the fifty entries of the array "data" would be the same. I'm not sure if this is a valid assumption, so the complexity would be: O(nr), with n equal to your data maximum range (0 to 50), and r equal to your search range (the number of entries in your data base). This is valid for each calculation, so that, considering i as the number of calculations, the complexity would be given as O(nri).
Upvotes: 1