Reputation: 4618
I have a dictionary as follows.
{'data mining': ['data', 'text mining', 'artificial intelligence'],
'neural networks': ['cnn', 'rnn', 'artificial intelligence'],
'data': [ 'text mining', 'artificial intelligence', 'data']}
I want to re-arrange the dictionary in the following way. i.e. remove entries that has similar values
by considering the longest key
.
{'data mining': ['data', 'text mining', 'artificial intelligence'],
'neural networks': ['cnn', 'rnn', 'artificial intelligence']}
In other words, both data mining
and data
have similar values. Therefore, I remove one entry and make the longest word as the key of the new enrty. i.e. 'data mining': ['data', 'text mining', 'artificial intelligence']
.
My current code is as follows.
import collections
compare = lambda x, y: collections.Counter(x) == collections.Counter(y)
myresults = {}
mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'],
'neural networks': ['cnn', 'rnn', 'artificial intelligence'],
'data': [ 'text mining', 'artificial intelligence','data']}
for key1, value1 in mydata.items():
for key2, value2 in mydata.items():
if compare(value1,value2):
mykeys = [key1, key2]
temp = {max((mykeys), key=len): value1}
myresults.update(temp)
print(myresults)
However, my real dictionary dataset has around 4 million of entries. So, I am wondering if there is an efficient way of doing this in python.
I am happy to provide more details if needed :)
Upvotes: 1
Views: 109
Reputation: 11781
Python built-in types to the rescue!
tmp = dict()
for topic, words in data.items():
ww = frozenset(words)
tmp[ww] = max(tmp.get(ww, topic), topic, key=len)
result = {topic: list(ww) for ww, topic in tmp.items()}
Upvotes: 1
Reputation: 16623
You can first sort the dictionary by length, so the longer keys are guaranteed to occur first.
from itertools import groupby
d = {
"data mining": ["data", "text mining", "artificial intelligence"],
"neural networks": ["cnn", "rnn", "artificial intelligence"],
"data": ["text mining", "artificial intelligence", "data"],
}
result = dict(
g
for k, (g, *_) in groupby(
sorted(d.items(), key=lambda x: len(x[0]), reverse=True),
key=lambda x: sorted(x[1]),
)
)
It's also only one line, which is always nice! :)
Printing result
yields:
{'neural networks': ['cnn', 'rnn', 'artificial intelligence'],
'data mining': ['data', 'text mining', 'artificial intelligence']}
Upvotes: 2
Reputation: 7361
This should be faster than comparing each element to each other as in your current code.
mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data': [ 'text mining', 'artificial intelligence','data']}
compared_values = set()
referencekeys = {}
myresults = {}
comparator = lambda x : ''.join(sorted(x))
for key, value in mydata.items():
compvalue = comparator(value)
if not set([compvalue]).issubset(compared_values):
compared_values.update([compvalue])
referencekeys[compvalue] = key
myresults[key] = value
else:
if len(key) > len(referencekeys[compvalue]):
myresults[key] = myresults.pop(referencekeys[compvalue])
referencekeys[compvalue] = key
print(myresults)
Here i define a comparator sorting the strings in the list values and joining them. Not sure if it is more efficient than yours which use Counters.
I loop over the dictionary once and store the strings generated by the comparator in a set()
. Each iteration of the loop I check if the new comparator string is in the set. If not, I add it to the set for future reference and add the key - value pair to the final result dictionary. Otherwise I check the key lengths and change the key of the dict as shown here if the new key is longer. I also need another dictionary in which I switch the key - compvalue (compvalue are the keys and key are the values) in order to keep track of which is the key of each compared value.
Should be faster (I did not check the time) because I have a single loop. The equivalent of your second loop is set([compvalue]).issubset(compared_values)
and set
is more efficient than a for
loop for this kind of jobs.
Have a try and see if it helps.
EDIT
Another similar idea which does not use set
just came to my mind.
referencekeys = {}
myresults = {}
comparator = lambda x : ''.join(sorted(x))
for key, value in mydata.items():
compvalue = comparator(value)
try:
if len(key) > len(referencekeys[compvalue]):
myresults[key] = myresults.pop(referencekeys[compvalue])
referencekeys[compvalue] = key
except KeyError:
referencekeys[compvalue] = key
myresults[key] = value
print(myresults)
Here I just try the if
statement. If referencekeys[compvalue]
throws a KeyError
means that the code has not found yet a similar value. Otherwise check for the key length.
Again I did not check the execution times, so I am not sure which is more efficient. But the result is the same.
EDIT 2
Following comment request, to keep empty lists as they are is enough to wrap the body of the loop in an if
statement (here I use the first code, but the same idea can be implemented for the second).
for key, value in mydata.items():
if len(value) > 0:
compvalue = comparator(value)
if not set([compvalue]).issubset(compared_values):
compared_values.update([compvalue])
referencekeys[compvalue] = key
myresults[key] = value
else:
if len(key) > len(referencekeys[compvalue]):
myresults[key] = myresults.pop(referencekeys[compvalue])
referencekeys[compvalue] = key
else:
myresults[key] = value
No need to store the key in referencekeys
if len(value)
== 0. If the original data mydata
is a single dictionary, keys are unique. So it's guaranteed that you are not overwriting anything.
For example, if you have mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data': [ 'text mining', 'artificial intelligence','data'], 'data bis':[], 'neural link':[]}
you will get: myresults = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data bis': [], 'neural link': []}
Upvotes: 1