Reputation: 107
I am trying to solve the following task:
1) Given the array A of size N.
2) Given set of range update queries i.e. (L, R, val) that should do A[i] += val for L <= i <= R.
3) Given the set of range sum queries i.e. (L, R) that should return sum(A[i]) for L <= i <= R.
Constraints:
1) Size of A, segments and queries sets N, N1, N2 <= 2^24.
2) 0 <= L <= 2^24, 0 <= R <= 2^24, 0 <= val <= 2^24.
Problem is to calculate sum of all range sum queries (S) modulo 2^32.
It seems that one may implement Segment tree to get desired sum with O(NlogN) time but actually we don't need to use this data structure. Instead, we can somehow calculate S in O(N) time just using 2 or 3 arrays. What is the general idea here?
I has recently wrote some algorithm in C++ to this problem but that is not optimal. Pseudocode:
We get Partial_sum array and can easily calculate any segment sum (L, R) in O(1) time just like Partial_sum[R+1] - Partial_sum[L].
But, the problem is that step 2 is too slow. Also, the loop in step 5 is hard to undestand. That is O(n) solution but constant is too high. I know there should be the way to improve step 5 but I don't undestand how to do this.
Could someone give some ideas or even suggest their own algorithm to solve this problem?
Thank you.
My algorithm implementation:
#include <cstring>
#include <iostream>
#include <stdio.h>
typedef unsigned int UINT;
typedef unsigned long long ULL;
//MOD and size of A
const ULL MOD = 4294967296LL; // 2^32
const size_t N = 16777216; // 2^24
//params for next_rand()
UINT seed = 0;
UINT a;
UINT b;
//get random segment
UINT next_rand()
{
seed = seed * a + b;
return seed >> 8;
}
int main()
{
UINT N1, N2;
std::cin >> N1 >> N2;
std::cin >> a >> b;
UINT* add = new UINT[N]; //Add array
UINT* subs = new UINT[N]; //Substraction array
UINT* part_sum = new UINT[N + 1]; //Partial sums array
memset(add, 0, sizeof(UINT) * N);
memset(subs, 0, sizeof(UINT) * N);
memset(part_sum, 0, sizeof(UINT) * (N + 1)); //Initialize arrays
//step 2
for (size_t i = 0; i < N1; ++i)
{
UINT val = next_rand();
UINT l = next_rand();
UINT r = next_rand();
if (l > r)
{
std::swap(l, r);
}
add[l] = (add[l] + val);
subs[r] = (subs[r] + val);
}
part_sum[0] = 0;
UINT curr_add = 0;
//step 5
for (size_t i = 1; i <= N; ++i)
{
part_sum[i] = (part_sum[i - 1] + curr_add + add[i - 1]);
curr_add = (curr_add + add[i - 1] - subs[i - 1]);
}
UINT res_sum = 0;
//Get any segment sum in O(1)
for (size_t i = 0; i < N2; ++i)
{
UINT l = next_rand();
UINT r = next_rand();
if (l > r)
{
std::swap(l, r);
}
res_sum = (res_sum + part_sum[r + 1] - part_sum[l]);
}
std::cout << res_sum;
delete []add;
delete []subs;
delete []part_sum;
return 0;
}
Upvotes: 1
Views: 1123
Reputation: 107
I've implemented described algorithm in different way. It should work faster. It should work faster than before at maximum values of update and sum query sizes.
#include <iostream>
#include <stdio.h>
#include <vector>
typedef unsigned int UINT;
typedef unsigned long long ULL;
const ULL MOD = 4294967296LL; // 2^32
const size_t N = 16777216; // 2^24
UINT seed = 0;
UINT a;
UINT b;
UINT next_rand()
{
seed = seed * a + b;
return seed >> 8;
}
std::vector <std::pair<UINT, UINT> > add;
int main()
{
UINT upd_query_count;
UINT sum_query_count;
// freopen("fastadd.in", "r", stdin);
// freopen("fastadd.out", "w", stdout);
scanf("%u", &upd_query_count);
scanf("%u", &sum_query_count);
scanf("%u", &a);
scanf("%u", &b);
add.reserve(N+1);
for (size_t i = 0; i < upd_query_count; ++i)
{
UINT val = next_rand();
UINT l = next_rand();
UINT r = next_rand();
if (l > r)
{
add[r].first += val;
add[l + 1].first -= val;
}
else
{
add[l].first += val;
add[r + 1].first -= val;
}
}
for (size_t i = 0; i < sum_query_count; ++i)
{
UINT l = next_rand();
UINT r = next_rand();
if (l > r)
{
++add[r].second;
--add[l + 1].second;
}
else
{
++add[l].second;
--add[r + 1].second;
}
}
UINT curr_add = 0;
UINT res_sum = 0;
UINT times = 0;
for (size_t i = 0; i < N; ++i )
{
curr_add += add[i].first;
times += add[i].second;
res_sum += curr_add * times;
}
printf("%u\n", res_sum);
return 0;
}
Upvotes: 1
Reputation: 38919
So add
ad subs
are very large arrays.
The first place you should look for a speed up here is in memory access. As N1
becomes large you will end up with a tremendous number of cache misses. This is probably somewhat beyond the scope to explain so I'll link: http://en.wikipedia.org/wiki/CPU_cache
As far as a way you can speed this up. Lets try to improve spatial equality by ordering our access.
std::vector<std::pair<UINT, UINT>> l{N1};
std::vector<std::pair<UINT, UINT>> r{N1};
for(size_t i = 0; i < N1; ++i){
const UINT val = next_rand();
const UINT first = next_rand();
const UINT second = next_rand();
if(first > second){
l[i] = std::make_pair(second, val);
r[i] = std::make_pair(first, val);
}else{
l[i] = std::make_pair(first, val);
r[i] = std::make_pair(second, val);
}
}
std::sort(l.begin(), l.end());
std::sort(r.begin(), r.end());
for(size_t i = 0; i < N1; ++i){
add[l[i].first] += l[i].second;
subs[r[i].first] += r[i].second;
}
Keep a couple things in mind, std::pair
's operator<
compares the first
element and if those are equal compares the second
. That's how I'm able to use std::sort
without writing any more code. However if first
is equal for two elemnts, the highest val
will always be the second one added. It doesn't seem like that would be a problem in your current code, but if it becomes one you can solve it by writing your own sort loop, rather than relying on std::sort
.
Also depending on how sparse access is to each cache block it may be faster to do your additions in separate loops.
As always the only way you can really improve performance is when working with actual numbers so be sure to do your own bench marking as your're comparing methods.
Upvotes: 0