Alex
Alex

Reputation: 37

Rolling median for a large dataset - python

I have a huge file with 200K lines, I need to find out the rolling median by counting distinct words in each line.

I have used numpy to calculate median as below

   a = np.array([])
   np.insert(a, 0, len(unique_word_list_by_line))
   median = np.median(a)

I feel that this is not efficient as numpy creates a new array everytime i insert an element. Is there a way to insert an element into a numpy array inplace?

Thanks

Upvotes: 1

Views: 5556

Answers (2)

Kirell
Kirell

Reputation: 9818

It is never good to dynamically fill a numpy array, it involves resizing and copying.

The rolling median is not trivial as it seems. This blog article talks about different implementations such as Skip list.

EDIT: It seems you use pandas. In pandas an implementation using skip lists and skipping NaN in already implemented. Have a look here.

A recipe for its implementation in pure python can also be found here.

Upvotes: 5

farenorth
farenorth

Reputation: 10791

I'd recommend doing it like this. Assuming you've loaded a text file into file, you could create the list a as:

a = []
for line in file:
    a.append(num_unique_words(line))

Where I've assumed you have a function num_unique_words that calculates the number of unique words in a string.

Now convert it to an array:

a = np.array(a)

Now call np.median on views into the array (note that the views are created by slicing the array:

median = np.empty_like(a)
for idx in xrange(len(a)):
    median[idx] = np.median(a[:idx])

Upvotes: 1

Related Questions