Sparsh Saxena
Sparsh Saxena

Reputation: 11

How can I improve the performance of my program that counts the occurrence of numbers in a list?

I am trying to solve the following problem:

Given: A positive integer k≤20k≤20, a positive integer n≤104n≤104, and k arrays of size n containing positive integers not exceeding 10^5.

Return: For each array, output an element of this array occurring strictly more than n/2 times if such element exists, and "-1" otherwise.

My attempt at it is the following:

myFile = open('rosalind_maj.txt', 'r')
outputf = open('output.txt', 'w')

TOTAL_ARRAYS, SIZE = myFile.readline().split() 

TOTAL_ARRAYS = int(TOTAL_ARRAYS)
SIZE = int(SIZE)

lines=myFile.readlines()


for i in range(0, TOTAL_ARRAYS):

    array = map(int, (lines[i].strip().split()))
    occurance = []
    output = []
    considered = []

    for element in array:

        if element not in considered:
            if (array.count(element)) > SIZE/2:
                output.append(element)
            considered.append(element)


    if len(output) == 0:
        print (-1)
        outputf.write( str(-1) + " " )
    else:
        for item in output:
            print item
            outputf.write(str(item) + " " )

A sample of the input text file looks like the following:

4 8
5 5 5 5 5 5 5 5
8 7 7 7 1 7 3 7
7 1 6 5 10 100 1000 1
5 1 6 7 1 1 10 1

and the correct answer to the sample input is:

5 7 -1 -1

My code works but is too slow. It takes 20 secs for an input with 19 arrays, each of size 8000 elements.

Upvotes: 0

Views: 415

Answers (2)

bli
bli

Reputation: 8184

I tried the following:

#!/usr/bin/env python
from itertools import imap
from collections import defaultdict

with open("rosalind_maj.txt", "r") as in_file:
    with open("output.txt", "w") as out_file:
        nb_arrays, ar_size = map(int, in_file.readline().split())
        majority = ar_size / 2.0
        # i is probably useless if the file has the correct number of lines
        for i, line in enumerate(in_file, start=1):
            counts = defaultdict(int)
            found = False
            if i > nb_arrays:
                break
            for element in line.strip().split():
                counts[element] += 1
                if counts[element] > majority:
                    out_file.write("%s " % element)
                    found = True
                    break
            if not found:
                out_file.write("-1 ")

On a dataset of a size about the one you said you tested (19 lines with 8000 elements), and on my old 2007 laptop, this runs in about 0.15 seconds whereas your code runs in about 7 seconds (I suppressed the print from it).

The resulting output files are the same.

I'm not sure the reason, but the slowness could be in part due to the append() and testing the presence of elements in a list in your code.

Upvotes: 0

henry14
henry14

Reputation: 11

You can use Majority Vote Algorithm. It runs in O(n) time and takes O(1) space!

Upvotes: 1

Related Questions