Reputation: 37
I have a huge file with 200K lines, I need to find out the rolling median by counting distinct words in each line.
I have used numpy to calculate median as below
a = np.array([])
np.insert(a, 0, len(unique_word_list_by_line))
median = np.median(a)
I feel that this is not efficient as numpy creates a new array everytime i insert an element. Is there a way to insert an element into a numpy array inplace?
Thanks
Upvotes: 1
Views: 5556
Reputation: 9818
It is never good to dynamically fill a numpy array, it involves resizing and copying.
The rolling median is not trivial as it seems. This blog article talks about different implementations such as Skip list.
EDIT: It seems you use pandas. In pandas an implementation using skip lists and skipping NaN in already implemented. Have a look here.
A recipe for its implementation in pure python can also be found here.
Upvotes: 5
Reputation: 10791
I'd recommend doing it like this. Assuming you've loaded a text file into file
, you could create the list a
as:
a = []
for line in file:
a.append(num_unique_words(line))
Where I've assumed you have a function num_unique_words
that calculates the number of unique words in a string.
Now convert it to an array:
a = np.array(a)
Now call np.median
on views into the array (note that the views are created by slicing the array:
median = np.empty_like(a)
for idx in xrange(len(a)):
median[idx] = np.median(a[:idx])
Upvotes: 1