Reputation: 502
I am trying to replace a number of elements (around 40k) from a much larger list (3 million elements) with a tag. The code below works however it is extremely slow.
def UNKWords(words):
words = Counter(words)
wordsToBeReplaced = []
for key, value in words.items():
if(value == 1):
wordsToBeReplaced.append(key)
return wordsToBeReplaced
wordsToBeReplaced = UNKWords(trainingWords)
replacedWordsList = ["<UNK>" if word in wordsToBeReplaced else word for word in trainingWords]
Is there a more efficient way of replacing elements in such a large list?
Upvotes: 2
Views: 58
Reputation: 92460
You can squeeze out a little more performance by avoiding the Counter and just keeping track of the words you've seen more than once. This effectively removes a loop from your function and lets you collect the words you've seen more than once in a single loop:
def UNKWords(words):
seen = set()
kept = set()
for word in words:
if word in seen:
kept.add(word)
else:
seen.add(word)
return kept
def replaceWords(words):
wordsToKeep = UNKWords(trainingWords)
return ["<UNK>" if word not in wordsToKeep else word for word in trainingWords]
…and then use sets instead of lists to test for membership, as others have mentioned, since these allow constant time membership tests.
Upvotes: 2
Reputation: 12927
In addition to @blhsing's answer I suggest using a frozenset; also doing the replacement in-place (unless, of course, you need to keep the original list for other purposes):
def UNKWords(words):
return frozenset(word for word, count in Counter(words).items() if count == 1)
wordsToBeReplaced = UNKWords(trainingWords)
for i, word in enumerate(trainingWords):
if word in wordsToBeReplaced:
trainingWords[i] = '<UNK>'
Upvotes: 0
Reputation: 106891
You can make wordsToBeReplaced
a set instead so that lookups can be done in constant time on average rather than linear time:
def UNKWords(words):
return {word for word, count in Counter(words).items() if count == 1}
Upvotes: 3