Reputation: 1766

How to count unique values from 1m string inputs?

I have a task at my university and still can not get it. On input I have N <= 1000000 (how many strings will be next) and strings. Strings are less then 1000 chars.

I need to print one number stdout - value how many unique strings. How to do that? The major restriction is that I can use only numpy lib, 5 seconds limit time and 5 Mb RAM (!). It also says the answer is correct if it's not more than 5% difference from real answer.

I tried this code:

import numpy as np


N = int(input())
a = np.array([])
for i in range(N):
    x = input()
    if not np.any(a == x):
        a = np.append(a, x)

print(len(a))

but it took 12 Mb and 97 ms. Another code:

N = int(input())
results = np.empty(N, dtype=object)
for i in range(N):
    results[i] = input()

print(len(np.unique(results)))

..it took 10 Mb

Any ideas how to get that? :)

UPDATED: I don't know what a.. but I checked now this code:

N = int(input())
a = np.array([])
cc = 0
for i in range(N):
    x = input()
    cc += 1
    if cc < 500:
        if not np.any(a == x):
            a = np.append(a, x)


print(len(a))

and it showed me 81ms and 8.7Mb. How is that possible if I filled only 500 elems in array?

TEST 3:

this took 98ms and 6.36Mb (almost 5!)

N = int(input())
s = set()
for i in range(N):
    x = input()
    s.add(x)

print(len(s))

TEST 4:

this took 98ms and 5.41Mb.

import hashlib

N = int(input())
s = set()
for i in range(N):
    x = input()
    s.add(hashlib.md5(x.encode()).hexdigest())

print(len(s))

TEST 5:

5.32Mb

import hashlib

N = int(input())
s = set()
s_add = s.add
for i in range(N):
    s_add(hashlib.md5(input().encode()).hexdigest()[:-3])

print(len(s))

TEST 6:

98ms and 5.63Mb

import hashlib
import itertools

N = int(input())
s = set()
s_add = s.add
for _ in itertools.repeat(None, N):
    s_add(str(abs(hash(input())))[:-3])

print(len(s))

TEST 7:

179ms and 6.92Mb

import itertools

N = int(input())
s = set()
s_add = s.add
for _ in itertools.repeat(None, N):
    s_add(abs(hash(input())))

print(len(s))

TEST 8:

286ms and 5.15Mb

N = int(input())
s = set()
s_add = s.add
for i in range(N):
    s_add(abs(hash(input())))

print(len(s))

Upvotes: 2

Answers (2)

vahvero

Reputation: 663

I've been following with great interest and would suggest (with all credit to @Brad Solomon)

import hashlib
import itertools

N = int(input())
s = set()
s_add = s.add
for _ in itertools.repeat(None, N):
    s_add(hashlib.md5(input().encode()).hexdigest()[:-3])

print(len(s))

(This really thou is not really by using only the numpy-lib.)

EDIT so something like this

N = int(input())
s = set()
s_add = s.add
for i in range(N):
    s_add(input()[:-300])

print(len(s))

Upvotes: 0

Brad Solomon

Reputation: 40888

A couple of speedups:

Use itertools.repeat(); it will avoid manufacturing distinct integer objects in each loop.
Bind the name s_add to s.add for faster lookup within the tightly-bound loop: s = set(); s_add = s.add, then call s_add(x) within the loop.*

^{*If memory serves me, a recent version of Python 3 did a much better job of minimizing the difference in attribute lookups. This one is probably of dubious incremental benefit, but give it a try.}

Regarding hashing - two common ones two choose from would be Md5 and Sha1, which produce hashed bytes objects of 16 and 20 bytes, respectively. On my laptop at least, Sha1 comes out faster, but not by a ton:

>>> import hashlib                                                                                                                                                                                                                                           
>>> import string
>>> import random
>>> b = bytes( 
...     "".join(random.choices(string.ascii_letters, k=999)), 
...     'utf-8' 
... )
>>> %timeit hashlib.md5(b).digest()                                                                                                                                                                                                                          
2.48 µs ± 8.62 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit hashlib.sha1(b).digest()                                                                                                                                                                                                                         
1.96 µs ± 6.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

But you may not need something that heavyweight. (You're not interested in the hash for its cryptographic security; you're interested in it for the sake of saving space in the accumulating set.) There's also the builtin hash(), and though this isn't guaranteed to be the same for the same input across Python sessions, that wouldn't seem like a requirement because you're processing the strings from a single process:

>>> %timeit hash(b)                                                                                                                                                                                                                                          
84.9 ns ± 0.193 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

As a disclaimer, I'm not too familiar (okay, not familiar at all) with the implementation of hash() and the entropy it produces. [Some reading here.) I would dare say the chance of collision is mathematically higher than with Md5 + Sha1, though probably still quite, quite low.

Upvotes: 2

How to count unique values from 1m string inputs?

Answers (2)

Related Questions