Artur Kharchenko
Artur Kharchenko

Reputation: 13

How to get a set of unique values from many lists efficiently (Python)

I have many (about 6000) text files with a list of IDs in each (each ID in file on a new row). There could be from 10000 to 10 million IDs in each file.

How to get a set of unique IDs from all these files?

My current code looks like this:

import glob
kk=glob.glob('C://Folder_with_all_txt_files/*')
ID_set=set()
for source in kk:
    a=[]
    csvReader = csv.reader(open(source, 'rt'))
    for row in csvReader:
        a.append(row)
    for i in xrange(len(a)):
        a[i]=a[i][0]
    s=set(a)
    ID_set=ID_set.union(s)
    del a,s

Problems with the current code:

Is there a more efficient way to do this task?

Also, is it possible to use all CPU cores in this task?

Upvotes: 1

Views: 192

Answers (2)

bgusach
bgusach

Reputation: 15205

This approach may be a little slower than Raymond's, but it avoids loading each file into memory at once:

import glob

ids = set()
for filename in glob.glob('C://Folder_with_all_txt_files/*'):
    with open(filename) as f:
        for id_ in f:
            ids.add(id_.strip())

Upvotes: 0

Raymond Hettinger
Raymond Hettinger

Reputation: 226734

Some thoughts:

  • Skip the creation of set s. Just update the ID_set directly.
  • Depending on what the files look like, you can just use read() and str.split() rather than the CSV reader.

Perhaps something like this will work for your dataset:

import glob

id_set = set()
for filename in glob.glob('C://Folder_with_all_txt_files/*'):
    with open(filename) as f:
        ids = f.read().split()
        id_set.update(ids)

Upvotes: 1

Related Questions