javaalgorithmsortingmergetime-complexity

Reputation: 64997

Efficiently merging and re-sorting sorted lists

This isn't the classic "merging two sorted" lists questions, which is fairly trivial to do in linear time.

What I'm trying to do is merge two lists of (key, value) pairs, already sorted by value, where there are objects with the same key in both lists: such objects should have their values merged (added), which may change their sort order. I'm primarily interested in how the sort can be efficiently performed using information from the already sorted lists, since the sort is the slowest part of this algorithm.

Let's take a concrete example. Imagine a List of Student objects:

class Student {
  final String name;
  final int score;
  ...
}

Given as input two List<Student> sorted by score, I'd like to create new merged list of students, where any student (identified by Student.name) appearing in both lists appears once in the final list, with a score equal to the sum of their score in both lists. The original lists should be left unmodified.

E.g.,

List 1:
{"bob", 20}
{"john", 15}
{"mark", 14}

List 2:
{"bill", 11}
{"mark", 9}
{"john", 1}

Result:
{"mark", 23}
{"bob", 20}
{"john", 16}
{"bill", 11}

The merging itself (identifying students that appear in both lists) can be done in expected O(1) time using any O(1) lookup/insert structure such as HashMap. What I'm most interested in is the sort step (although I don't exclude solutions that do the merging and the sorting at the same time).

The question though, is how do I efficiently re-sort such a list? The ordering of the existing lists clearly puts some constraints on the final position of elements in the merged list. For example, if a student is at position i in the first list and j in the second, he must appear among the first i + j students in the merged list by a simple argument analyzing the maximum number of students that could have a higher score. It's not immediately clear if this information would be useful in sorting the list, however.

You can assume that in many cases students that score highly in one list score highly in the other. The algorithm should work when that is not the case, but it gives you some additional information about the distribution that may be useful, in addition to the fact that the lists are already sorted.

It seems like this type of operation would be common for any type of distributed query + sorting implementation. For example, imagine a "select state,count(*) group by state" type of query issue against a distributed system (to count the number of records in each state) - naturally you'd get a sorted list of (state, count) objects back from each node, and then you'd want to merge and re-sort those during the reduce operation. It seems silly to throw away all the work already done on the distributed nodes.

Quantitative Notes

I'm interested in the case where the lists to be merged and re-sorted are small: usually around 256 entries. The range of scores varies, from 0 to 100 in the some cases, up to about 0 - 10,000,000 in others. Of course, given the small number of elements, each operation will be fast in absolute time, even with naive algorithms - but performed billions of times, it adds up.

In fact, one of the answers below has proven that you can't, in general, do this better than a plain sort for increasing list sizes (i.e., taking n to be the combined list size) - but I'm actually more interested in doing this many times, for fixed size lists, with good empirical performance.

Upvotes: 13

Answers (7)

greybeard

Reputation: 2497

(Dismissing to first merge and then re-sort,) My first stab would be to declare the sorted input lists (semi-static) priority queues and proceed in two phases. To avoid an ambiguity in the term merge, I will call creating/altering an object to represent the values of "common objects" combine/combination; to reduce clutter, I'll denote priority queue PQ.

identify objects that appear in both/more than one "input queue"
(in a way of secondary interest here)
- combine (probably invalidating the position in either list),
- put them in another (dynamic) PQ (if necessary)
- remove from/invalidate in the (input) queue(s) where they shall be no longer.
Merge the PQs in the usual way

This should work in linear time in the number n of objects, plus O(c log c) for c "common" objects where the combined object would be out of sequence in place of any object combined. (…given expected constant time to (identify and) combine one (set of common) object(s) (see remark about expected O(1) in the question))
Then, I'm afraid that doesn't properly address the main point:

Is there a way to capitalise on the final key to be a (linear, monotone)
combination of at least one ordered sequence and "other values"?
(With lots of common entries - thinking all.)

If combination decreased priority monotonically (in the example, addition of (positive) score values increases priority), do without a combine phase and combine objects when merging PQs, potentially reducing memory as well as time required.
Otherwise, choose one PQ to take objects from (decreasing in priority), to potentially combine with other objects.
The "worst case" would seem priority of the combined objects showing no correlation: I'm afraid the answer is
generally, no. (see user2570465's answer for an explicit argument)
(as BeeOnRope points out, the (sequence of) objects picked being dominated in combination (disadvantageous choice) may actually turn into a good case if that can be detected and exploited.)
Then again, (linear, monotone) combination can be expected to skew the distribution of keys even without (positive) correlation (assumed in the question): be sure to use a (dynamic) PQ implementation where insertion in order is the best case rather than the worst:
For one, take an implicit heap in an array (children of element at index i are at 2i and 2i+1 (or 2i+1&2i+2 "not wasting element 0", but a bit more index manipulation):
just append items (with a distribution skewed to decreasing priority) to the end:
the expected number of exchanges with parent is below 1 (would be almost 1 without skew).

Upvotes: 5

toto

Reputation: 1188

Try it:

//Class Student modified.

public class Student {

        String name = "";
        int score = 0;

        public Student(String name, int score) {
            this.name = name;
            this.score = score;
        }

        @Override
        public boolean equals(Object v) {
            if (v instanceof Student) {
                return this.name.equals(((Student) v).name);
            } else if (v instanceof String) {
                return this.name.equals(String.valueOf(v));
            } else {
                return false;
            }
        }

        @Override
        public int hashCode() {
            int hash = 7;
            hash = 67 * hash + Objects.hashCode(this.name);
            return hash;
        }
    }

//Class CustomComparator to sort a list by object or stri

public class CustomComparator implements Comparator<Object> {

        public int orderby = 0;

        @Override
        public int compare(Object o1, Object o2) {
            Student st1 = (Student)o1;
            Student st2 = (Student)o2;
            if (orderby==0){
                //order by name.
                return st1.name.compareTo(st2.name);
            }else{
                //order by score.
                Integer a=st1.score;
                Integer b = st2.score;
                return a.compareTo(b);
            }

        }
    }

//Example

List<Student> A = new ArrayList<Student>();
A.add(new Student("bob", 20));
A.add(new Student("john", 15));
A.add(new Student("mark", 14));

List<Student> B = new ArrayList<Student>();
B.add(new Student("bill", 11));
B.add(new Student("mark", 9));
B.add(new Student("john", 1));

List<Student> merge = new ArrayList<Student>();
merge.addAll(A);
merge.addAll(B);

//Copy.
List<Student> result = new ArrayList<Student>();
for (Student st : merge) {
    if (result.contains(st)) {
        for (Student r : result) {
            if (r.equals(st)) {
                System.out.println(st.score + " > " +r.score);
                //Se the best score
                if (st.score > r.score) {
                    r.score = st.score;
                    break;
                }
            }
        }
    } else {
        result.add(st);
    }
}

//Sort result by name.
CustomComparator comparator = new CustomComparator();
comparator.orderby=0; //1 sort by score.
Collections.sort(result, comparator);
for (Student r : result) {
    System.out.println(r.name + " = " + r.score);
}

//The result example:

bill = 11 | bob = 20 | john = 15 | mark = 14

Upvotes: 0

Ravindra HV

Reputation: 2608

As I see it, the fact that the list is already sorted by score does not help since first we need to merge the scores.

Also while using hash-map may seem to provide a O(1) seek, as per my understanding the underlying implementation will imply that in terms of throughput which includes creation of the hashmap, the efficiency will still be not as good (when compared to the one below).

The approach would be as follows:

Apply inplace-binary-most-significant-bit-radix-sort on List-1 and List-2 combined.
Students whose score appear twice will then be adjacent, merge such entries.
Finally use an inplace-binary-most-significant-bit-radix-sort (as above) on the scores of students in the merged list (such that the score-and-student pair are re-arranged as appropriate).

Update #1 : The sort in step 1 is on the student name.

Upvotes: 0

user2570465

Reputation: 2497

It looks like you want a O(n) merge like they do with merge sort. I think I may have some bad news for you. I'm going to (hopefully) prove that you cannot do better than O(nlog(n)) for the generalized problem: (so consequently, you should just use any of the optimal O(nlog(n)) solutions presented by others). First, I'll start with the intuition as to why this is the case, and then I'll write an informal proof.

Intuition

The idea is to turn the problem of sorting a list into your problem and show that if you can solve your problem faster than O(nlog(n)), then I can sort any list faster than O(nlog(n)), which we know to be false. We'll just work with integers to keep things simple.

Suppose you have some strange sequence to be sorted: X = 1, 3, 2, -10, 5, 4, 7, 25. I will now construct two lists Dec and Inc. I start with 1 = 1 + 0 (i.e. x_1 = x_1 + 0). Then after that, if x_{i-1} -> x_i is an increase, I subtract 1 from my value in Dec and compute the necessary value in Inc to sum to x_i. If x_{i-1} -> x_i is a decrease, then I add 1 to my value in Inc and compute the necessary value in Dec to sum to x_i. We apply this algorithm to the sequence in the following table:

idx   x     Dec    Inc      
----------------------
 1 |  1  =  1   +  0
 2 |  3  =  0   +  3
 3 |  2  =  -2  +  4
 4 | -10 =  -15 +  5
 5 |  5  =  -16 +  21
 6 |  4  =  -18 +  22
 7 |  7  =  -19 +  23
 8 |  25 =  -20 +  45

Notice that I can convert from sorting to your problem in O(n) - note: reverse Inc in O(n) time to get two decreasing sequences. We can then input to your problem

A = {(1, 1), (2, 0), (3, -2), (4, -15), (5, -16), (6, -18), (7, -19), (8, -20)}
B = {(8, 45), (7, 23), (6, 22), (5, 21), (4, 5), (3, 4), (2, 3), (1, 0)}

Now if you can combine A and B into sorted order by the sum of their values (second element in the ordered pairs), and get something like

C = {(8, 25), (7, 7), (5, 5), (6, 4), (2, 3), (3, 2), (1, 1), (4, -10)

then you've essentially done an argsort (sort by index) of the initial sequence x_i. So if you solve your problem faster than O(nlog(n)), then I can sort faster than O(nlog(n)) by solving your problem first and then converting the solution to my problem of sorting a list. In particular, I would be sorting with complexity O(n) + O(complexity to solve your problem)

Statement to be Proven

Let your two key-value lists be

A = [(ka_i, va_i) | i = 1..n]
B = [(kb_i, vb_i) | i = 1..m]

sorted in decreasing order of value. You cannot find the combined list

C = [(ka_i, va_i + va_j) | ka_i = kb_j]

in faster than O(nlog(n)) time.

Proof Outline

The only assumption this proof makes is that you cannot sort a list faster than O(nlog(n)) time and this proof will proceed by providing a reduction that runs in O(n) time from sorting any arbitrary list to your problem.

In essence, we'll show that if we solve your problem faster than O(nlog(n)), then we can also sort any arbitrary list faster than O(nlog(n)). And we already know it is impossible to sort a list faster than nlog(n), so your desired solution must also be impossible.

Proof Details

For simplicity, we'll take sorting a list of integers. Let S = x_1, x_2, ..., x_n be any sequence of integers. We will now construct two lists, Dec and Inc.

We have three constraints:

Inc is strictly increasing
Dec is strictly decreasing
On iteration i of the algorithm, Inc[j] + Dec[j] = x_j for all j = 1..i-1

As their names imply, Dec will be strictly decreasing and Inc will be strictly increasing. We will maintain the invariant that x_i = Dec[i] + Inc[i] for i = 1..n

Here is the reduction:

# (Assume 1-indexed lists)
1. Initialize Inc = [x_1] and Dec = [0]
2. For i = 2..n:
    a. if x[i] > x[i-1] then
          Dec.append(Dec[i-1] - 1)
          Inc.append(x_i - Dec[i])
       else   # We must have x[i] <= x[i-1]
          Inc.append(Inc[i-1] + 1)
          Dec.append(x_i - Inc[i])

3. Create list A and B:
    A = [(i, Dec[i]) | i = 1..n]
    B = [(i, Inc[i]) | i = 1..n]
4. B = reverse(B) # Reverse B because B was in increasing order and we
                  # need both lists to be in decreasing order
5. A and B are inputs to your algorithm.
  If your algorithm can combine A and B into sorted order,
  then we have also sorted S (via argsort on the keys).

You're probably also hungry for a proof that my ad hoc method of choosing to increase Inc by 1 or decrease Dec by 1 works. Well here's an informal "proof" (you can formalize it by using induction):

Case x_{i} > x_{i-1}

Recall that in this case, we choose to decrement Dec by 1. We are given that x_{i} > x_{i-1} and we know that Dec_{i-1} + Inc_{i-1} = x_{i-1}. We can also say that (Dec_{i-1} - 1) + (Inc_{i+1} + 1) = x_{i-1}.

Since x_{i} > x_{i-1}, we must have x_{i} >= x_{i-1} + 1. Therefore, x_{i} >= (Dec_{i-1} - 1) + (Inc_{i+1} + 1). Therefore, if we only decrement Dec by 1, we will be forced to add at least 1 to Inc, so Inc remains strictly increasing.

Case x_{i} ≤ x_{i-1}

Recall that in this case, we choose to increment Inc by 1. We are given that x_{i} <= x_{i-1} and we know that Dec_{i-1} + Inc_{i-1} = x_{i-1}. We can also say that (Dec_{i-1} - 1) + (Inc_{i+1} + 1) = x_{i-1} and since x_{i} <= x_{i-1}, it must be the case that (Dec_{i-1} - 1) + (Inc_{i+1} + 1) <= x_{i}. Therefore, if we add 1 to Inc, we are sure that we must subtract at least 1 from Dec.

Conclusion

Your problem cannot be done faster than O(nlog(n)). You are better off just combining into a HashMap and then sorting its elements in O(nlog(n)) because it is impossible to find a faster solution.

Feel free to comment, though, if you find a problem with the reduction or have questions. I'm pretty sure it's correct. Of course, if I'm mistaken about sorting being no faster than O(nlog(n)), this whole proof falls apart, but last I checked, someone already proved that O(nlog(n)) was the fastest complexity for sorting. Comment if you prefer a formal reduction. It's getting late right now for me and I skipped some "formalizations", but I can get edit them in when I get a chance.

If you code up the algorithm for creating the reduction, you may gain a better understanding.

Also: see this post if you want an explanation for O(nlog(n)) bound on sorting What are the rules for the "Ω(n log n) barrier" for sorting algorithms?

Upvotes: 5

Massimo Petrus

Reputation: 1891

It seems to me that any solution should generally fall in the category of O(n*log(n)) complexity (with n= length(L1)+length(L2), or n=max(length(L1), length(L2))).

My basic algorithm would be as follows

  Let's use two intermediate structures:
  - a TreeSet R, which guarantees ordering by rank, 
  - an HashMap M, which guarantees constant time insertion and retrieve 
  Call R's size n

  1 for each student in each list
      1.1 find the student in M by name (O(1)).
      1.2 if the student is found          
         1.2.1 find the student in R by its rank (O(log(n)).  
         1.2.2 remove the student from R (O(log(n))
         1.2.3 update the student rank 
      1.3 else 
        1.3.1. put the student in M O(1)
      1.4 put the student in R (O(log(n))
  2 At the end (if needed) transform the TreeSet in a list

The overall O complexity is O(n*log(n)),

Assuming L1 is the longest of the 2 lists, a small optimization would be avoiding to find the student when traversing L1, in this case the O complexity is the same, but you'll have less operations in absolute. The best case is of course when Len(L1)>>Len(L2).

There may be more complex solutions or better data structures to reduce operations number, but i don't think there may be a better O complexity as, basically you have 2 possibilities

1- mantaining the result list ordered, so scan lists, finding matches and recompute position each time

2- Using an intermediate map to lower the match finding complexity, then sort the result

Both the possibilities are usually computed in O(n*log(n))

Upvotes: 0

Stephen C

Reputation: 719249

It sounds like you need to use an adaptive sort algorithm.

"A sorting algorithm falls into the adaptive sort family if it takes advantage of existing order in its input. It benefits from the presortedness in the input sequence – or a limited amount of disorder for various definitions of measures of disorder – and sorts faster. Adaptive sorting is usually performed by modifying existing sorting algorithms." - Wikipedia article linked above.

Examples include insertion sort and Timsort; see the article above for more. Note that in Java 8, the Arrays.sort(Object[]) library method uses a modified Timsort.

I am not aware of any published algorithm that deals with the specific requirements of your example, but here is an idea:

Perform a classic merge on the two input lists L1 and L2:
- When you merge a pair of objects and it changes the keys that determine the ordering, put the merged object into temporary list A.
- Otherwise put the objects into temporary list B ... which will remain ordered.
Sort the temporary list A.
Merge lists A and B.

Assuming that:

the lengths of the original lists L1 & L2 are M & N respectively, and
the number of merged objects whose keys changed is R (which is less than max(M, N)),

then the overall complexity is O(M + N + RlogR). If R is small relative to M + N, then this should be an improvement.

In your example, every case where there is a match between elements in the input lists is likely to move the element in the order. If it moves the element, it will move to later in the order (and never earlier). So another idea is to do a three-way merge between a the original 2 lists and a priority queue. When you get a match, you merge the counts and add the result to the priority queue.

The complexity is similar to the previous, but you avoid extra pass to merge the lists. And also the RlogR becomes RlogA where A is the average size of the priority queue.

Keep in mind that I'm especially interested in the case where R is approximately equal to max(M,N), and also M == N.

(You didn't state that in your question! And, in fact it doesn't make any sense for R to be > min(M,N)!)

In that case, maybe just use the priority queue as an incremental sorter. Throw all merged records and all records that cannot be merged into the queue, and pull our records if when they have a key / score that is less than the current heads of the two lists. Assuming that M and N are the list lengths, and A is the average priority queue size, then the complexity is max(M,N) * log A). Whether this is an improvement on simple re-sort will depend on whether the average A is significantly (in Big O terms) less than max(M,N). That will depend on the inputs ... and the merging function.

The number (N) varies, but 256 to 1,000 is typical. Perhaps as much as 10,000.

For lists of that typical size, you are down at a level where the complexity analysis is not going to be helpful. But also, you are down at a level where optimization becomes pointless ... unless you are doing the operation many, many times, or on a tight "time budget".

This is all very approximate, and my maths are "sketchy" at best.

A proper investigation would entails hundreds of hours to research, code, test, benchmark, analyze various alternatives ... and we'd probably still get the answer that it depends on the input data set size and distribution.

Upvotes: 7

bsd

Reputation: 2727

Maintain a map which is mapping of something unique to actual Student info.
```
Map<String, Student> scores = new HashMap<>();
```

Iterate through all the lists and put them into the scores map

for (Student s : list1) {
    if (scores.containsKey(s.name)) {
        scores.put(s.name, s.score + scores.get(s.name));
    } else {
        scores.put(s.name, s.score); 
    } 
}

Sort the entrySet using Java 8 streams

scores.entrySet()
  .stream()
  .sorted((s1, s2) -> (s2.getValue().score - s1.getValue().score)
  .map(s1 -> s1.getValue())
  .collect(Collectos.toList());

This is still O(N Log N)

You cannot sort it using the standard merge algorithm because the lists contain names whose position are not the same. The standard merge algorithm does not process the same element twice. After finding the duplicate and adding the student score, you need to re-sort. You are breaking the precondition for merge sort that both lists are sorted at all times by their values.

Upvotes: 0

Efficiently merging and re-sorting sorted lists

Quantitative Notes

Answers (7)

Intuition

Statement to be Proven

Proof Outline

Proof Details

Case x_{i} > x_{i-1}

Case x_{i} ≤ x_{i-1}

Conclusion

Related Questions