Brad
Brad

Reputation: 12262

Commas in preg_match string of words

I store a list of of bad words I want to filter posts against before I store them in the database.

I store the bad words in an array that I implode with a pipe delimiter to do the check once.

$bad_words_regex = "/\b" . implode('|', config_item('bad_words')) . "\b/";

if( preg_match(strtolower($bad_words_regex), strtolower(trim($message))) == FALSE ) {
    // save to database
}

I noticed messages with commas did not get saved to the database. I imagine there are other characters I should check for (-, _, @, #).

I need to modify the first line so it doesn't return true when a message contains a character like a comma and other characters you think I will run into the same problem with.

UPDATED with an example post that does not save and the array of some of the bad words:

Example message that does not save to db (it contains a white space character at the end of the sentence):

This is your last chance to decide between The Car, The Personality and the Lion 

Bad words array (not a complete list)

//bad words array
$config['bad_words'] = array(
    '2g1c',
    '2 girls 1 cup',
    'acrotomophilia',
    'anal',
    'anilingus',
    'Split',
    'anus',
    'arsehole',
    'ass',
    'asshole',
    'assmunch',
    'auto erotic',
    'autoerotic',
    'babeland',
    'baby batter',
    'ball gag',
    'ball gravy',
    'ball kicking'
);

UPDATE: I found two instances where it found a match, pis (the pis in episode) and trio (in the word patriot). I need help modifying the regex to search the word as a whole and not pieces of the word.

Upvotes: 0

Views: 791

Answers (4)

Zac
Zac

Reputation: 1009

As @ridgerunner mentioned in the comments to your question, the regex "or" operator requires parentheses surrounding the list of words.

For example, your current regex looks like:

/\bword1|word2|word3\b/

It should be:

/\b(word1|word2|word3)\b/

To make that work with your PHP code, do something like this:

$bad_words_regex = "/\b(" . implode('|', config_item('bad_words')) . ")\b/";

Upvotes: 1

Quixrick
Quixrick

Reputation: 3200

Since your words are in an array, you can use PHP's built-in function 'in_array'. That, used in combination with some basic REGEX, I think can get you what you want.

// SET THE DEFAULTS
$sentence = 'The foxes, birds, and leopard-owls live in the forest.';
$bad_words = array('forest', 'lake', 'meadow');
$bad_word_found = false;


// REMOVE PUNCTUATION & LOWERCASE
// "the foxes birds and leopard-owls live in the forest"
$sentence_scrub = trim(strtolower(preg_replace('/[^A-Z0-9 -]/i', '', $sentence)));


// SPLIT THE SENTENCE INTO CHUNKS
$sentence_bits = explode(' ', $sentence_scrub);


// LOOP THROUGH THE ARRAY AND CHECK TO SEE IF ANY OF THE 
// - WORDS APPEAR IN THE BAD WORD ARRAY
foreach ($sentence_bits AS $potential_bad_word) {

    if (in_array($potential_bad_word, $bad_words)) {
        $bad_word_found = true;
    }

}


if ($bad_word_found) {
    // DO SOMETHING HERE
}
else {
    // GO AHEAD AND WRITE TO THE DB
}

Upvotes: 0

Alvin S. Lee
Alvin S. Lee

Reputation: 5182

Using your code, it worked for me. That is, your example message does get saved to the db.

Here's what I have:

// Set up array of bad words in $config['bad_words']
// $config['bad_words'] = array(
//   ...
// );

$imploded = implode('|', $config['bad_words']);
print "IMPLODED ARRAY: $imploded\n\n";

$bad_words_regex = "/\b$imploded\b/";
print "REGULAR EXPRESSION: $bad_words_regex\n\n";

$message = 'This is your last chance to decide between The Car, The Personality and the Lion ';
if (preg_match(strtolower($bad_words_regex), strtolower(trim($message))) == FALSE ) {
  print "SAVE\n";
}
else {
  print "DO NOT SAVE\n";
}

I'm calling $config['bad_words'] directly when imploding, not calling config_item.

Not sure if the modified code above, with all those print statements, might point you in the right direction.

Upvotes: 0

Vasili Syrakis
Vasili Syrakis

Reputation: 9601

I notice that you have included the shorthand character class \b in your code. I presume that you wrap these tokens around your bad_words...

The problem here might be, that the \b tokens are not matching, because there is no "word-boundary" on the end of badwordz, as an example; it is a non-word boundary (\B).

You may have to experiment with different word boundaries, such as whitespace, if that is appropriate.
I would need a better look at the content you are applying your regex to, in order to craft a better expression.

Upvotes: 0

Related Questions