Reputation: 432
I'm trying to come up with the best solution for the following problem:
I have a list of filenames, and associated with each filename is an ID; these IDs are non-unique, meaning that several filenames might be associated with one ID.
So I could pack my data up as: (ID, [filename1, filename2,...])
The problem is that I would like to work with the IDs as a set since I will need to group and extract differences and intersections with another predefined grouping of these IDs, and I need the operations to be relatively fast since I have about a million IDs.
But I know no way to keep ID associated with the list of filenames while treating ID as an element in a set. Is this possible to do with sets, or is there any set extension that enables this?
Upvotes: 0
Views: 477
Reputation: 9379
It sounds like your data looks something like the sample data below. If so, then the code shows how to use a hash table to do what you're asking. The hash table could either be a Python dict
(hashed on id
as key with a list
of file names as associated value) or simply a set
of id
elements if that's what you really want (though as others have suggested in the comments, a dict
is potentially the best solution).
files = [
{'filename':'foo101', 'id':1},
{'filename':'foo102', 'id':1},
{'filename':'foo103', 'id':1},
{'filename':'foo201', 'id':2},
{'filename':'foo202', 'id':2},
{'filename':'foo301', 'id':3},
{'filename':'foo401', 'id':4},
]
fileDict = defaultdict(list)
for d in files:
fileDict[d['id']].append(d['filename'])
[print(id, fileNames) for id, fileNames in fileDict.items()]
idSet = set(fileDict)
print(idSet)
Sample output:
1 ['foo101', 'foo102', 'foo103']
2 ['foo201', 'foo202']
3 ['foo301']
4 ['foo401']
{1, 2, 3, 4}
The above code uses a defaultdict(list)
for convenience, but you could also use a regular dict
as follows:
files = [
{'filename':'foo101', 'id':1},
{'filename':'foo102', 'id':1},
{'filename':'foo103', 'id':1},
{'filename':'foo201', 'id':2},
{'filename':'foo202', 'id':2},
{'filename':'foo301', 'id':3},
{'filename':'foo401', 'id':4},
]
fileDict = {}
for d in files:
if d['id'] not in fileDict:
fileDict[d['id']] = []
fileDict[d['id']].append(d['filename'])
[print(id, fileNames) for id, fileNames in fileDict.items()]
idSet = set(fileDict)
print(idSet)
Upvotes: 1