Reputation: 3

Deleting duplicated words in a very large words list

I'm a beginner at this and I wrote a program that generates a wordlist following specific algorithms. The problem is it makes duplications.

So I'm looking for a way to make the code iterates through the range given or the number of words given to make without duplicating words.

OR write another program that goes through the words list the first program made and delete any duplicated words in that file which is going to take time but is worth it.

The words that should be generated should be like this one X4K7GB9y, 8 characters in length, following the rule [A-Z][0-9][A-Z][0-9][A-Z][A-Z][0-9][a-z], and the code is this:

import random
import string

random.seed(0)
NUM_WORDS = 100000000

with open("wordlist.txt", "w", encoding="utf-8") as ofile:     
    for _ in range(NUM_WORDS): 
        uppc = random.sample(string.ascii_uppercase, k=4)
        lowc = random.sample(string.ascii_lowercase, k=1) 
        digi = random.sample(string.digits, k=3) 
        word = uppc[0] + digi[0] + uppc[1] + digi[1] + uppc[2] + uppc[3] + digi[2] + lowc[0] 
        print(word, file=ofile)

I'll appreciate it if you can modify the code to not make duplications or write another code that checks the wordlist for duplications and deletes them. Thank you so much in advance

Upvotes: -1

Answers (4)

Gowtham Jayachandiran

Reputation: 118

Below program will generate unique values following the condition and also writes the same into a text file.

this set of code creates unique values

import random
import string 

 n = 100
 l = []
    
for i in range(n):
    word = chr(random.randint(65, 90)) + str(random.randint(1, 9)) + chr(random.randint(65, 90)) + str(random.randint(1, 9)) + chr(random.randint(65, 90))+ chr(random.randint(65, 90)) + str(random.randint(1, 9)) + chr(random.randint(65, 90)).lower()
    l.append(word)
finallist = list(set(l))

And the below codes will write the outcome into a file.

with open("Uniquewords.txt", "w") as f:
    for i in finallist:
        f.write(i)
        f.write("\n")
    f.close()

Upvotes: 0

accdias

Reputation: 5372

Here is a possible solution using a set() to deduplicate the words list:

import random
import string

random.seed(0)
words_count = 100_000_000
words = set()

while len(words) < words_count:
    u = random.sample(string.ascii_uppercase, k=4)
    l = random.sample(string.ascii_lowercase, k=1)
    d = random.sample(string.digits, k=3)
    words.add(f'{u[0]}{d[0]}{u[1]}{d[1]}{u[2]}{u[3]}{d[2]}{l[0]}')

with open('wordlist.txt', 'w', encoding='utf-8') as f: 
    print(*words, file=f, sep='\n')

Bear in mind that it will take lots of memory and a long time to generate a hundred million random words.

Upvotes: 0

Patrick Artner

Reputation: 51683

You can prevent duplicate words from the get go by remembering what you created and not write it again.

This needs a bit of memory to hold 100.000.000 8 letter words - you can lessen that by only remembering the hashes of words. You will miss out on some hash collisions, but with about 26**5 * 10**3 = 11,881,376,000 possible combinations you should be fine.

import random
import string

random.seed(0)
NUM_WORDS = 100 # reduced for testing purposes
found = 0
words = set()
with open("wordlist.txt", "w", encoding="utf-8") as ofile:     
    while found < NUM_WORDS: 
        # get 5 upper case letters, use the 5h as .lower()
        l = random.sample(string.ascii_uppercase, k=5) 
        d = random.sample(string.digits, k=3) 
        word = l[0] + d[0] + l[1] + d[1] + l[2] + l[3] + d[2] + l[4].lower()
        if hash(word) in words:
            continue
        print(word, file=ofile)
        words.add(hash(word))
        found += 1

Upvotes: 0

TassosKat

Reputation: 11

Given that your algorithm creates a list of words(unique or not). You could use set to retain only the unique words, look at the example below.

word_list = ["word1", "word2", "word3", "word1"]
unique_words = set(word_list)

It returns the unique_words list that includes only ["word1", "word2", "word3"].

Upvotes: 1

Deleting duplicated words in a very large words list

Answers (4)

Related Questions