Aseem Goyal
Aseem Goyal

Reputation: 2723

Median of large amount of numbers for each sets of given size

The question is :

find median of a large data(n numbers) in fixed size (k) of numbers

What i thought is :
maintain 2 heaps , maximum heap for numbers less than current median and minimum heap for numbers greater than current median .
The main concept is to FIND the 1st element of previous set in one of the heap (depending on it is < or > current median), and replace it with the new element we encounter .
Now modify such as to make |size(heap1) - size(heap2)| = 1 or 0 because median is avg. of top element if size1 != size2 else the top element of the heap with size > size of other .

The problem i am facing is the time complexity increases because finding the element takes O(n) time so total O(n*k), so i am not able to achieve the desired complexity O(n*logk) (as was required in source of the question).

How should it be reduced , without using extra space ?

edit : input : 1 4 3 5 6 2 , k=4
median :
from 1 4 3 5 = (4+3)/2
from 4 3 5 6 = (4+5)/2 from 3 5 6 2= (3+5)/2

Upvotes: 1

Views: 1168

Answers (3)

Vikram Bhat
Vikram Bhat

Reputation: 6246

This problem seems like the one which you have in efficient implementation in dijkstra shortest path where we need to delete(update in case of dijkstra) an element which is not at top of heap.But you can use the same work around to this problem but with extra space complexity. Firstly you cannot use inbuilt heap library, create your own heap data structure but maintain pointers to each element in the heap so while adding or removing element update the pointers to each element. So after calculating median of first k elements delete the first element directly from heap (min or max) according to whether it is greater or less than median using pointers and then use heapify at that position. Then sizes of heaps change and then you can get new median using same logic as you are using for adjusting the heap sizes and calculating median.

Heapify takes O(logk) hence your total cost will be O(n*logk) but will need O(n) more space for pointers.

Upvotes: 0

mcdowella
mcdowella

Reputation: 19621

If you can find a balanced tree implementation that gives you efficient access to the central element you should probably use it. You could also do this with heaps much as you suggest, as long as you keep an extra array of length k which tells you where each element in the window lives in its heap, and which heap it is in. You will have to modify the code that maintains the heap to update this array when it moves things around, but heap code is a lot easier to write and a lot smaller than balanced tree code. Then you don't need to search through all the heap to remove the item which has just gone off the edge of the window and the cost is down to n log k.

Upvotes: 1

Fred Foo
Fred Foo

Reputation: 363787

You can solve this problem using an order-statistic tree, which is a BST with some additional information that allows finding medians, quantiles and other order statistics in O(log n) time in a tree with n elements.

First, construct an OST with the first k elements. Then, in a loop:

  1. Find and report the median value.
  2. Remove the first element that was inserted into the tree (you can find out which element this was in the array).
  3. Insert the next element from the array.

Each of these steps takes O(log k) if the tree is self-balancing, because we maintain the invariant that the tree never grows beyond size k, which also gives O(k) auxiliary space. The preprocessing takes O(k log k) time while the loop repeats n + 1 - k times for a total time of O(n log k).

Upvotes: 1

Related Questions