Reputation: 13
I have many (about 6000) text files with a list of IDs in each (each ID in file on a new row). There could be from 10000 to 10 million IDs in each file.
How to get a set of unique IDs from all these files?
My current code looks like this:
import glob
kk=glob.glob('C://Folder_with_all_txt_files/*')
ID_set=set()
for source in kk:
a=[]
csvReader = csv.reader(open(source, 'rt'))
for row in csvReader:
a.append(row)
for i in xrange(len(a)):
a[i]=a[i][0]
s=set(a)
ID_set=ID_set.union(s)
del a,s
Problems with the current code:
Is there a more efficient way to do this task?
Also, is it possible to use all CPU cores in this task?
Upvotes: 1
Views: 192
Reputation: 15205
This approach may be a little slower than Raymond's, but it avoids loading each file into memory at once:
import glob
ids = set()
for filename in glob.glob('C://Folder_with_all_txt_files/*'):
with open(filename) as f:
for id_ in f:
ids.add(id_.strip())
Upvotes: 0
Reputation: 226734
Some thoughts:
Perhaps something like this will work for your dataset:
import glob
id_set = set()
for filename in glob.glob('C://Folder_with_all_txt_files/*'):
with open(filename) as f:
ids = f.read().split()
id_set.update(ids)
Upvotes: 1