Reputation: 697

c++ array sorting with some specifications

I'm using C++. Using sort from STL is allowed.

I have an array of int, like this :

1 4 1 5 145 345 14 4

The numbers are stored in a char* (i read them from a binary file, 4 bytes per numbers)

I want to do two things with this array :

swap each number with the one after that

4 1 5 1 345 145 4 14
sort it by group of 2

4 1 4 14 5 1 345 145

I could code it step by step, but it wouldn't be efficient. What I'm looking for is speed. O(n log n) would be great.

Also, this array can be bigger than 500MB, so memory usage is an issue.

My first idea was to sort the array starting from the end (to swap the numbers 2 by 2) and treating it as a long* (to force the sorting to take 2 int each time). But I couldn't manage to code it, and I'm not even sure it would work.

I hope I was clear enough, thanks for your help : )

Upvotes: 1

Answers (3)

WhozCraig

Reputation: 66254

This is the most memory efficient layout I could come up with. Obviously the vector I'm using would be replaced by the data blob you're using, assuming endian-ness is all handled well enough. The premise of the code below is simple.

Generate 1024 random values in pairs, each pair consisting of the first number between 1 and 500, the second number between 1 and 50.
Iterate the entire list, flipping all even-index values with their following odd-index brethren.
Send the entire thing to std::qsort with an item width of two (2) int32_t values and a count of half the original vector.
The comparator function simply sorts on the immediate value first, and on the second value if the first is equal.

The sample below does this for 1024 items. I've tested it without output for 134217728 items (exactly 536870912 bytes) and the results were pretty impressive for a measly macbook air laptop, about 15 seconds, only about 10 of that on the actual sort. What is ideally most important is no additional memory allocation is required beyond the data vector. Yes, to the purists, I do use call-stack space, but only because q-sort does.

I hope you get something out of it.

Note: I only show the first part of the output, but I hope it shows what you're looking for.

#include <iostream>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <cstdint>


// a most-wacked-out random generator. every other call will
//  pull from a rand modulo either the first, or second template
//  parameter, in alternation.
template<int N,int M>
struct randN
{
    int i = 0;
    int32_t operator ()()
    {
        i = (i+1)%2;
        return (i ? rand() % N : rand() % M) + 1;
    }
};

// compare to integer values by address.
int pair_cmp(const void* arg1, const void* arg2)
{
    const int32_t *left = (const int32_t*)arg1;
    const int32_t *right = (const int32_t *)arg2;
    return (left[0] == right[0]) ? left[1] - right[1] : left[0] - right[0];
}

int main(int argc, char *argv[])
{
    // a crapload of int values
    static const size_t N = 1024;

    // seed rand()
    srand((unsigned)time(0));

    // get a huge array of random crap from 1..50
    vector<int32_t> data;
    data.reserve(N);
    std::generate_n(back_inserter(data), N, randN<500,50>());

    // flip all the values
    for (size_t i=0;i<data.size();i+=2)
    {
        int32_t tmp = data[i];
        data[i] = data[i+1];
        data[i+1] = tmp;
    }

    // now sort in pairs. using qsort only because it lends itself
    //  *very* nicely to performing block-based sorting.
    std::qsort(&data[0], data.size()/2, sizeof(data[0])*2, pair_cmp);
    cout << "After sorting..." << endl;
    std::copy(data.begin(), data.end(), ostream_iterator<int32_t>(cout,"\n"));
    cout << endl << endl;

    return EXIT_SUCCESS;
}

Output

After sorting...
1
69
1
83
1
198
1
343
1
367
2
12
2
30
2
135
2
169
2
185
2
284
2
323
2
325
2
347
2
367
2
373
2
382
2
422
2
492
3
286
3
321
3
364
3
377
3
400
3
418
3
441
4
24
4
97
4
153
4
210
4
224
4
250
4
354
4
356
4
386
4
430
5
14
5
26
5
95
5
145
5
302
5
379
5
435
5
436
5
499
6
67
6
104
6
135
6
164
6
179
6
310
6
321
6
399
6
409
6
425
6
467
6
496
7
18
7
65
7
71
7
84
7
116
7
201
7
242
7
251
7
256
7
324
7
325
7
485
8
52
8
93
8
156
8
193
8
285
8
307
8
410
8
456
8
471
9
27
9
116
9
137
9
143
9
190
9
190
9
293
9
419
9
453

Upvotes: 2

JoergB

Reputation: 4463

With some additional constraints on both your input and your platform, you can probably use an approach like the one you are thinking of. These constraints would include

Your input contains only positive numbers (i.e. can be treated as unsigned)
Your platform provides uint8_t and uint64_t in <cstdint>
You address a single platform with known endianness.

In that case you can divide your input into groups of 8 bytes, do some byte shuffling to arrange each groups as one uint64_t with the "first" number from the input in the lower-valued half and run std::sort on the resulting array. Depending on endianness you may need to do more byte shuffling to rearrange each sorted 8-byte group as a pair of uint32_t in the expected order.

If you can't code this on your own, I'd strongly advise you not to take this approach.

A better and more portable approach (you have some inherent non-portability by starting from a not clearly specified binary file format), would be:

std::vector<int> swap_and_sort_int_pairs(const unsigned char buffer[], size_t buflen) {
   const size_t intsz = sizeof(int);
   // We have to assume that the binary format in buffer is compatible with our int representation
   // we also require an even number of integers
   assert(buflen % (2*intsz) == 0);

   // load pairwise
   std::vector< std::pair<int,int> > pairs;
   pairs.reserve(buflen/(2*intsz));
   for (const unsigned char* bufp=buffer; bufp<buffer+buflen; bufp+= 2*intsz) {
      // It would be better to have a more portable binary -> int conversion
      int first_value = *reinterpret_cast<int*>(bufp);
      int second_value = *reinterpret_cast<int*>(bufp + intsz);
      // swap each pair here
      pairs.emplace_back( second_value, firstvalue );
   }
   // less<pair<..>> does lexicographical ordering, which is what you are looking ofr
   std::sort(pairs.begin(), pairs.end());

   // convert back to linear vector 
   std::vector<int> result;
   result.reserve(2*pairs.size());
   for (auto& entry : pairs) {
      result.push_back(entry.first);
      result.push_back(entry.second);
   }
   return result;
}

Both the inital parse/swap pass (which you need anyway) and the final conversion are O(N), so the total complexity is still (O(N log(N)).

If you can continue to work with pairs, you can save the final conversion. The other way to save that conversion would be to use a hand-coded sort with two-int strides and two-int swap: much more work - and possibly still hard to get as efficient as a well-tuned library sort.

Upvotes: 2

thiton

Reputation: 36059

Do one thing at a time. First, give your data some *struct*ure. It seems that each 8 byte form a unit of the form

struct unit {
    int key;
    int value;
}

If the endianness is right, you can do this in O(1) with a reinterpret_cast. If it isn't, you'll have to live with a O(n) conversion effort. Both vanish compared to the O(n log n) search effort.

When you have an array of these units, you can use std::sort like:

bool compare_units(const unit& a, const unit& b) {
    return a.key < b.key;
}

std::sort(array, length, compare_units);

The key to this solution is that you do the "swapping" and byte-interpretation first and then do the sorting.

Upvotes: 0

c++ array sorting with some specifications

Answers (3)

Related Questions