Chops
Chops

Reputation: 502

Replacing elements in long list Python

I am trying to replace a number of elements (around 40k) from a much larger list (3 million elements) with a tag. The code below works however it is extremely slow.

def UNKWords(words):
    words = Counter(words)
    wordsToBeReplaced = []
    for key, value in words.items():
        if(value == 1):
            wordsToBeReplaced.append(key)
    return wordsToBeReplaced

wordsToBeReplaced = UNKWords(trainingWords)

replacedWordsList = ["<UNK>" if word in wordsToBeReplaced else word for word in trainingWords]

Is there a more efficient way of replacing elements in such a large list?

Upvotes: 2

Views: 58

Answers (3)

Mark
Mark

Reputation: 92460

You can squeeze out a little more performance by avoiding the Counter and just keeping track of the words you've seen more than once. This effectively removes a loop from your function and lets you collect the words you've seen more than once in a single loop:

def UNKWords(words):
    seen = set()
    kept = set()
    for word in words:
        if word in seen:
            kept.add(word)
        else:
            seen.add(word)
    return kept

def replaceWords(words):   
    wordsToKeep = UNKWords(trainingWords)
    return ["<UNK>" if word not in wordsToKeep else word for word in trainingWords]

…and then use sets instead of lists to test for membership, as others have mentioned, since these allow constant time membership tests.

Upvotes: 2

Błotosmętek
Błotosmętek

Reputation: 12927

In addition to @blhsing's answer I suggest using a frozenset; also doing the replacement in-place (unless, of course, you need to keep the original list for other purposes):

def UNKWords(words):
    return frozenset(word for word, count in Counter(words).items() if count == 1)

wordsToBeReplaced = UNKWords(trainingWords)
for i, word in enumerate(trainingWords):
    if word in wordsToBeReplaced:
        trainingWords[i] = '<UNK>'

Upvotes: 0

blhsing
blhsing

Reputation: 106891

You can make wordsToBeReplaced a set instead so that lookups can be done in constant time on average rather than linear time:

def UNKWords(words):
    return {word for word, count in Counter(words).items() if count == 1}

Upvotes: 3

Related Questions