casper
casper

Reputation: 39

Find N largest lines from a file: how would you make this better?

I was recently rejected from a potential employer after submitting this code. They suggested I wasn't technically capable enough. I'm wondering if someone could shed light on to how to make this better/more efficient.

The question was to find the N longest lines from a file of multiple lines. This ultimately boiled down to a sorting problem, so I built an algorithm to find the N largest numbers from a list of numbers as so:

def selection(numbers, n):

    maximum = []

    for x in range (0, n):

        maximum.append(numbers[x])
        ind = x

        for y in range ( x, len(numbers) ):
            if numbers[y] > maximum[len(maximum)-1]:
                maximum[len(maximum)-1] = numbers[y]
                numbers[ind], numbers[y] = numbers[y], numbers[ind]

    return maximum

This runs in O(n), unless N = n, in which case it runs in O(n^2). I was surprised to hear them doubt my technical abilities, so I thought I'd bring it to you SO. How do I make this better?

EDIT: Thanks for the feedback. To clarify: I populated a list with the line-by-line word-counts from the file, and ran it through this function.

EDIT2: Some people mentioned syntax. I've only been doing Python for about a day or two. My employer suggested I write it in Python (and I mentioned that I didn't know Python), so I assumed small syntax errors and methods wouldn't be such an issue.

EDIT3: Turns out my initial reasoning was flawed with the selection sort. I had it in my head that a min-heap would be nlogn, but I forgot that the average complexity for my code is n^2. Thanks for the help everyone.

Upvotes: 0

Views: 479

Answers (3)

Fred Foo
Fred Foo

Reputation: 363597

from heapq import nlargest

def longest_lines(n, filename):
    with open(filename) as input:
        return nlargest(n, input, key=len)

Alright, addressing the comments below:

def longest_lines(n, filename):
    heap = []
    with open(filename) as input:
        for ln in filename:
            push(heap, ln)
            if len(heap) > n:
                pop(heap)
    return heap

where push and pop are the good old min-heap insert and delete-min algorithms that can be found in any textbook (and that I never get right in one go, so I'm not posting them now), comparing lines by their length. This runs in O(N×lg(n)) time where N is the number of lines in the file, consuming O(n) temporary space.

Note that the resulting list is not sorted by length, but adding that can be done by popping the elements until the heap is empty and reversing the result of that.

Upvotes: 4

erickson
erickson

Reputation: 269687

I would use a heap, but a min-heap, not a max-heap, which may seem counterintuitive.

Create an empty heap.
For each line, 
  if there are less than N lines in the heap, add the current line;
  otherwise,
    if the current line is longer than the minimum element in the heap,
      remove the minimum element from the heap, and
      add the current line to the heap.
Return the contents of the heap.

Upvotes: 2

Martin James
Martin James

Reputation: 24857

Does this boil down to sorting? If you have a list that can hold N lines, you could keep a couple of ints that hold the length of the shortest line so far found and its index in the list. If any line is read that is longer than this line, you only need to iterate the list once to find the new shortest line and also to replace the old shortest line.

Upvotes: 0

Related Questions