Reputation: 3
I'm a beginner at this and I wrote a program that generates a wordlist following specific algorithms. The problem is it makes duplications.
So I'm looking for a way to make the code iterates through the range given or the number of words given to make without duplicating words.
OR write another program that goes through the words list the first program made and delete any duplicated words in that file which is going to take time but is worth it.
The words that should be generated should be like this one X4K7GB9y
, 8 characters in length, following the rule
[A-Z][0-9][A-Z][0-9][A-Z][A-Z][0-9][a-z]
, and the code is this:
import random
import string
random.seed(0)
NUM_WORDS = 100000000
with open("wordlist.txt", "w", encoding="utf-8") as ofile:
for _ in range(NUM_WORDS):
uppc = random.sample(string.ascii_uppercase, k=4)
lowc = random.sample(string.ascii_lowercase, k=1)
digi = random.sample(string.digits, k=3)
word = uppc[0] + digi[0] + uppc[1] + digi[1] + uppc[2] + uppc[3] + digi[2] + lowc[0]
print(word, file=ofile)
I'll appreciate it if you can modify the code to not make duplications or write another code that checks the wordlist for duplications and deletes them. Thank you so much in advance
Upvotes: -1
Views: 72
Reputation: 118
Below program will generate unique values following the condition and also writes the same into a text file.
this set of code creates unique values
import random
import string
n = 100
l = []
for i in range(n):
word = chr(random.randint(65, 90)) + str(random.randint(1, 9)) + chr(random.randint(65, 90)) + str(random.randint(1, 9)) + chr(random.randint(65, 90))+ chr(random.randint(65, 90)) + str(random.randint(1, 9)) + chr(random.randint(65, 90)).lower()
l.append(word)
finallist = list(set(l))
And the below codes will write the outcome into a file.
with open("Uniquewords.txt", "w") as f:
for i in finallist:
f.write(i)
f.write("\n")
f.close()
Upvotes: 0
Reputation: 5372
Here is a possible solution using a set()
to deduplicate the words list:
import random
import string
random.seed(0)
words_count = 100_000_000
words = set()
while len(words) < words_count:
u = random.sample(string.ascii_uppercase, k=4)
l = random.sample(string.ascii_lowercase, k=1)
d = random.sample(string.digits, k=3)
words.add(f'{u[0]}{d[0]}{u[1]}{d[1]}{u[2]}{u[3]}{d[2]}{l[0]}')
with open('wordlist.txt', 'w', encoding='utf-8') as f:
print(*words, file=f, sep='\n')
Bear in mind that it will take lots of memory and a long time to generate a hundred million random words.
Upvotes: 0
Reputation: 51683
You can prevent duplicate words from the get go by remembering what you created and not write it again.
This needs a bit of memory to hold 100.000.000 8 letter words - you can lessen that by only remembering the hashes of words. You will miss out on some hash collisions, but with about 26**5 * 10**3 = 11,881,376,000
possible combinations you should be fine.
import random
import string
random.seed(0)
NUM_WORDS = 100 # reduced for testing purposes
found = 0
words = set()
with open("wordlist.txt", "w", encoding="utf-8") as ofile:
while found < NUM_WORDS:
# get 5 upper case letters, use the 5h as .lower()
l = random.sample(string.ascii_uppercase, k=5)
d = random.sample(string.digits, k=3)
word = l[0] + d[0] + l[1] + d[1] + l[2] + l[3] + d[2] + l[4].lower()
if hash(word) in words:
continue
print(word, file=ofile)
words.add(hash(word))
found += 1
Upvotes: 0
Reputation: 11
Given that your algorithm creates a list of words(unique or not). You could use set to retain only the unique words, look at the example below.
word_list = ["word1", "word2", "word3", "word1"]
unique_words = set(word_list)
It returns the unique_words list that includes only ["word1", "word2", "word3"].
Upvotes: 1