Kulpas
Kulpas

Reputation: 394

Python Faker fake.word(): Filter out profanity

I noticed that when I use the fake.word() function with locale set to 'pl_PL' it sometimes generates a swear word which is not ideal for me. Is there an easy way to force Faker to stop outputting swear words, preferably without having to list all of the swear words myself?

Upvotes: 0

Views: 718

Answers (2)

BeRT2me
BeRT2me

Reputation: 13242

The original list comes from wiktionary, which we can see has been updated since this was made...

We can get the new list ourselves though, using the API:

import requests
from bs4 import BeautifulSoup

page = 'Indeks%3APolski_-_Najpopularniejsze_s%C5%82owa_1-2000'
method = 'html'
url = f"https://pl.wiktionary.org/api/rest_v1/page/{method}/{page}"

r = requests.get(url)
soup = BeautifulSoup(r.content)

words = [x['title'] for x in soup.find_all('a')]

print(words[:50])
print(len(words))

Output (First 50):

['w', 'z', 'być', 'na', 'i', 'do', 'nie', 'który', 'lub', 'to', 'się', 'o', 'mieć', 'coś', 'ten', 'dotyczyć', 'on', 'od', 'co', 'język', 'po', 'że', 'ktoś', 'przez', 'osoba', 'miasto', 'jeden', 'jak', 'za', 'ja', 'rok', 'a', 'bardzo', 'swój', 'dla', 'taki', 'człowiek', 'cecha', 'kobieta', 'mój', 'część', 'związany', 'móc', 'dwa', 'ona', 'związać', 'ze', 'mały', 'jakiś', 'miejsce']
2000

Then, we could replace the word list like so:

from faker.providers.lorem.pl_PL import Provider as PLProvider

PLProvider.word_list = tuple(words)

Upvotes: 2

jkr
jkr

Reputation: 19300

Unfortunately I do not know of a way to remove swear words without typing them out.

One option is to remove the swear words from the word list of the pl_PL lorem Provider class.

from faker.providers.lorem.pl_PL import Provider as PLProvider

bad_words = ["kurwa"]
PLProvider.word_list = tuple(word for word in PLProvider.word_list if word not in bad_words)

(I use tuple here because that is the original type of word_list.)

Here is a more complete code example, including an assertion that the bad words are not in the list of possible words.

from faker import Faker

from faker.providers.lorem.pl_PL import Provider as P
bad_words = ["kurwa"]
P.word_list = tuple(word for word in P.word_list if word not in bad_words)
del P

Faker.seed(0)
fake = Faker(locale="pl_PL")

assert "kurwa" not in fake.words(1999, unique=True)

Upvotes: 2

Related Questions