draxous
draxous

Reputation: 59

Parsing whole terms in Python/json profanity filter

I have a json file containing terms to check against for a profanity filter.

["bad", "word", "plug"]

And I am using this (found from another article) to parse the json and search any data object for set words.

def word_filter(self, *field_names):

    import json
    from pprint import pprint

    with open('/var/www/groupclique/website/swearWords.json') as data_file:    
        data = json.load(data_file)

    for field_name in field_names:
        for term in data:
            if term in field_name:
                self.add_validation_error(
                    field_name,
                    "%s has profanity" % field_name)


class JobListing(BaseProtectedModel):
    id = db.Column(db.Integer, primary_key=True)
    category = db.Column(db.String(255))
    job_title = db.Column(db.String(255))

    @before_flush
    def clean(self):
        self.word_filter('job_title')  

The issue is if I use the string "plumber" it fails the check due to the word "plug" in the json file. Because "plu" being in both terms. Is there any way to force the entire word in the json file to be used instead of a partial? Output once ran isnt erroneous:

({ "validation_errors": { "job_title": " job_title has profanity" } })

HTTP PAYLOAD:
{
    "job_title":"plumber",    
}

Upvotes: 0

Views: 532

Answers (1)

Matt Yaple
Matt Yaple

Reputation: 21

You can use string.split() as a way to isolate whole words of the field_name. When you split, it returns a list of each part of the string split up by the specified delimiter. Using that, you can check if the profane term is in the split list:

import json

with open('terms.json') as data_file:    
    data = json.load(data_file)

for field_name in field_names:
    for term in data:
        if term in field_name.split(" "):
            self.add_validation_error(
                field_name,
                "%s has profanity" % field_name)

Where this gets dicey is if there is punctuation or something of the sort. For example, the sentence: "Here comes the sun." will not match the bad word "sun", nor will it match "here". To solve the capital problem, you'll want to change the entire input to lowercase:

if term in field_name.lower().split(" "):

Removing punctuation is a bit more involved, but this should help you implement that.

There may well be more edge cases you'll need to consider, so just a heads up on two quick ones I thought of.

Upvotes: 2

Related Questions