How to get a set of unique values from many lists efficiently (Python)

Question

I have many (about 6000) text files with a list of IDs in each (each ID in file on a new row). There could be from 10000 to 10 million IDs in each file.

How to get a set of unique IDs from all these files?

My current code looks like this:

import glob
kk=glob.glob('C://Folder_with_all_txt_files/*')
ID_set=set()
for source in kk:
    a=[]
    csvReader = csv.reader(open(source, 'rt'))
    for row in csvReader:
        a.append(row)
    for i in xrange(len(a)):
        a[i]=a[i][0]
    s=set(a)
    ID_set=ID_set.union(s)
    del a,s

Problems with the current code:

1) Consumes too much RAM
2) Too slow

Is there a more efficient way to do this task?

Also, is it possible to use all CPU cores in this task?

Raymond Hettinger · Accepted Answer

Some thoughts:

Skip the creation of set s. Just update the ID_set directly.
Depending on what the files look like, you can just use read() and str.split() rather than the CSV reader.

Perhaps something like this will work for your dataset:

import glob

id_set = set()
for filename in glob.glob('C://Folder_with_all_txt_files/*'):
    with open(filename) as f:
        ids = f.read().split()
        id_set.update(ids)

How to get a set of unique values from many lists efficiently (Python)

Answers (2)

Related Questions