Keynaan
Keynaan

Reputation: 69

Python remove duplicate values of one key in dict

I have a dictionary like this:

Files:
{'key1': ['path1', 'path1', 'path2', 'path1', 'path2'], 
'key2': ['f', 'f', 'f', 'f', 'f'], 
'key_file': ['file1', 'file1', 'file2', 'file1', 'file2']}

I want to delete all the duplicate values und in 'key_file' and their other values in the other keys ('key1' and 'key2').

Desired dictionary:

Files:
{'key1': ['path1', 'path2'], 
'key2': ['f', 'f'], 
'key_file': ['file1', 'file2']}

I couldn't figure out a solution which preserved the order and deleted every duplicate item and their values in the other keys.

Thanks a lot.

EDIT:

'key2': ['f', 'f', 'f', 'f', 'f']

becomes

'key2': ['f', 'f'],

because there are two distinct files.

I don't want to delete every duplicate in every key. 'path1' is related to 'file1' and 'path2' is related to 'file2' as is the 'f' in key2 for both cases. Actually in reality there are several keys more, but this is my minimal example. That is my problem. I have found several solutions to delete every duplicate.

EDIT2:

Maybe I was a bit confusing.

Every key has the same length as they describe a filename (in key_file), the according path (in key1) and some other describing strings (in key2, etc). It can happen that the same file is stored in different locations (paths), but I know, that it is the same file if the filename is exactly the same.

Basically what I was looking for, is a function which detects the second value of key_file with the filename file1 as a duplicate of the first value file1 and deletes the second value from every key. The same for value number 4 (file1) and 5 (file2). The resulting dictionary would then look like the one I mentioned.

I hope this explains it better.

Upvotes: 0

Views: 2726

Answers (5)

user4450405
user4450405

Reputation:

Here is my implementation:

In [1]: mydict = {'key1': ['path1', 'path1', 'path2', 'path1', 'path2'], 'key2': ['f', 'f', 'f', 'f', 'f'], 'key_file': ['file1', 'file1', 'file2', 'file1', 'file2']}

In [2]: { k: sorted(list(set(v))) for (k,v) in mydict.iteritems() }
Out[2]: {'key1': ['path1', 'path2'], 'key2': ['f'], 'key_file': ['file1', 'file2']}

Test

In [6]: mydict
Out[6]:
{'key1': ['path1', 'path1', 'path2', 'path1', 'path2'],
 'key2': ['f', 'f', 'f', 'f', 'f'],
 'key_file': ['file1', 'file1', 'file2', 'file1', 'file2']}

In [7]: uniq = { k: sorted(list(set(v))) for (k,v) in mydict.iteritems() }

In [8]: for key in uniq:
   ...:     print 'KEY    :', key
   ...:     print 'VALUE  :', uniq[key]
   ...:     print '-------------------'
   ...: 
KEY    : key2
VALUE  : ['f']
-------------------
KEY    : key1
VALUE  : ['path1', 'path2']
-------------------
KEY    : key_file
VALUE  : ['file1', 'file2']
-------------------

Upvotes: 0

fredtantini
fredtantini

Reputation: 16556

A naive approach: iterate over the keys and add to a new dict each values:

>>> newFiles={'key1': [], 'key2':[], 'key_file':[]}
>>> for i,j in enumerate(Files['key_file']):
...   if j not in newFiles['key_file']:
...      for key in newFiles.keys():
...         newFiles[key].append(Files[key][i])
...
>>> newFiles
{'key2': ['1', '3'], 'key1': ['a', 'c'], 'key_file': ['file1', 'file2']}

with OrderedDict:

>>> for j in OrderedDict.fromkeys(Files['key_file']):
...   i = Files['key_file'].index(j)
...   if j not in newFiles['key_file']:
...     for key in newFiles.keys():
...       newFiles[key].append(Files[key][i])
...
>>> newFiles
{'key2': ['1', '3'], 'key1': ['a', 'c'], 'key_file': ['file1', 'file2']}

Note: if a "file" in key_file always has the same key_1 and key_2, there are better ways. For instance using zip:

>>> z=zip(*Files.values())
>>> z
[('f', 'path1', 'file1'), ('f', 'path1', 'file1'), ('f', 'path2', 'file2'), ('f', 'path1', 'file1'), ('f', 'path2', 'file2')]
>>> OrderedDict.fromkeys(z)
OrderedDict([(('f', 'path1', 'file1'), None), (('f', 'path2', 'file2'), None)])
>>> list(OrderedDict.fromkeys(z))
[('f', 'path1', 'file1'), ('f', 'path2', 'file2')]
>>> zip(*OrderedDict.fromkeys(z))
[('file1', 'file2'), ('path1', 'path2'), ('f', 'f')]

Upvotes: 2

tobias_k
tobias_k

Reputation: 82899

As I understand the question, it seems that corresponding values in the different lists in the dictionary belong together, while values within the same list are unrelated to each other. In this case, I'd suggest using a different data structure. Instead of having a dictionary with three lists of items, you can make one list holding triplets.

>>> files = {'key1': ['path1', 'path1', 'path2', 'path1', 'path2'], 
             'key2': ['f', 'f', 'f', 'f', 'f'], 
             'key_file': ['file1', 'file1', 'file2', 'file1', 'file2']}
>>> files2 = set(zip(files["key1"], files["key2"], files["key_file"]))
>>> print files2
set([('path2', 'f', 'file2'), ('path1', 'f', 'file1')])

Or if you want to make it more dictionary-like, you could do this, afterwards:

>>> files3 = [{"key1": k1, "key2": k2, "key_file": kf} for k1, k2, kf in files2]
>>> print files3
[{'key2': 'f', 'key1': 'path2', 'key_file': 'file2'}, 
 {'key2': 'f', 'key1': 'path1', 'key_file': 'file1'}]

Note that the order of the triplets in the top-level list may be different, but items that belong together are still together in the contained tuples or dictionaries.

Upvotes: 0

Kasravnd
Kasravnd

Reputation: 107287

You can use collections.OrderedDict to keep your dictionary in order and set to remove the duplicates :

>>> d={'key1': ['path1', 'path1', 'path2', 'path1', 'path2'], 
... 'key2': ['f', 'f', 'f', 'f', 'f'], 
... 'key_file': ['file1', 'file1', 'file2', 'file1', 'file2']}
>>> from collections import OrderedDict
>>> OrderedDict(sorted([(i,list(set(j))) for i,j in d.items()], key=lambda t: t[0]))
OrderedDict([('key1', ['path2', 'path1']), ('key2', ['f']), ('key_file', ['file2', 'file1'])])

you need to use set for values to remove duplicates then sort your items based on keys and finally to keep your dictionary in sort use OrderedDict.

Edit : if you want to all values have the same length as max value use the following :

>>> s=sorted([(i,list(set(j))) for i,j in d.items()], key=lambda t: t[0])
>>> M=max(map(len,[i[1] for i in s])
>>> f_s=[(i,j) if len(j)==M else (i,[j[0] for t in range(M)]) for i,j in s]
>>> f_s
[('key1', ['path2', 'path1']), ('key2', ['f', 'f']), ('key_file', ['file2', 'file1'])]
>>> OrderedDict(f_s)
OrderedDict([('key1', ['path2', 'path1']), ('key2', ['f', 'f']), ('key_file', ['file2', 'file1'])])

but if you just want the first 2 element of any values you can use slicing :

>>> OrderedDict(sorted([(i,j[:2]) for i,j in d.items()],key=lambda x: x[0])
... )
OrderedDict([('key1', ['path1', 'path1']), ('key2', ['f', 'f']), ('key_file', ['file1', 'file1'])])

Upvotes: 1

Bhargav Rao
Bhargav Rao

Reputation: 52071

OrderedDict is the best as it preserves order

You can add it to a set and then make it a list

Example

for i in d:
    d[i] = list(set(d[i]))

Upvotes: 1

Related Questions