Reputation: 12262
I store a list of of bad words I want to filter posts against before I store them in the database.
I store the bad words in an array that I implode with a pipe delimiter to do the check once.
$bad_words_regex = "/\b" . implode('|', config_item('bad_words')) . "\b/";
if( preg_match(strtolower($bad_words_regex), strtolower(trim($message))) == FALSE ) {
// save to database
}
I noticed messages with commas did not get saved to the database. I imagine there are other characters I should check for (-, _, @, #).
I need to modify the first line so it doesn't return true when a message contains a character like a comma and other characters you think I will run into the same problem with.
UPDATED with an example post that does not save and the array of some of the bad words:
Example message that does not save to db (it contains a white space character at the end of the sentence):
This is your last chance to decide between The Car, The Personality and the Lion
Bad words array (not a complete list)
//bad words array
$config['bad_words'] = array(
'2g1c',
'2 girls 1 cup',
'acrotomophilia',
'anal',
'anilingus',
'Split',
'anus',
'arsehole',
'ass',
'asshole',
'assmunch',
'auto erotic',
'autoerotic',
'babeland',
'baby batter',
'ball gag',
'ball gravy',
'ball kicking'
);
UPDATE: I found two instances where it found a match, pis (the pis in episode) and trio (in the word patriot). I need help modifying the regex to search the word as a whole and not pieces of the word.
Upvotes: 0
Views: 791
Reputation: 1009
As @ridgerunner mentioned in the comments to your question, the regex "or" operator requires parentheses surrounding the list of words.
For example, your current regex looks like:
/\bword1|word2|word3\b/
It should be:
/\b(word1|word2|word3)\b/
To make that work with your PHP code, do something like this:
$bad_words_regex = "/\b(" . implode('|', config_item('bad_words')) . ")\b/";
Upvotes: 1
Reputation: 3200
Since your words are in an array, you can use PHP's built-in function 'in_array'. That, used in combination with some basic REGEX, I think can get you what you want.
// SET THE DEFAULTS
$sentence = 'The foxes, birds, and leopard-owls live in the forest.';
$bad_words = array('forest', 'lake', 'meadow');
$bad_word_found = false;
// REMOVE PUNCTUATION & LOWERCASE
// "the foxes birds and leopard-owls live in the forest"
$sentence_scrub = trim(strtolower(preg_replace('/[^A-Z0-9 -]/i', '', $sentence)));
// SPLIT THE SENTENCE INTO CHUNKS
$sentence_bits = explode(' ', $sentence_scrub);
// LOOP THROUGH THE ARRAY AND CHECK TO SEE IF ANY OF THE
// - WORDS APPEAR IN THE BAD WORD ARRAY
foreach ($sentence_bits AS $potential_bad_word) {
if (in_array($potential_bad_word, $bad_words)) {
$bad_word_found = true;
}
}
if ($bad_word_found) {
// DO SOMETHING HERE
}
else {
// GO AHEAD AND WRITE TO THE DB
}
Upvotes: 0
Reputation: 5182
Using your code, it worked for me. That is, your example message does get saved to the db.
Here's what I have:
// Set up array of bad words in $config['bad_words']
// $config['bad_words'] = array(
// ...
// );
$imploded = implode('|', $config['bad_words']);
print "IMPLODED ARRAY: $imploded\n\n";
$bad_words_regex = "/\b$imploded\b/";
print "REGULAR EXPRESSION: $bad_words_regex\n\n";
$message = 'This is your last chance to decide between The Car, The Personality and the Lion ';
if (preg_match(strtolower($bad_words_regex), strtolower(trim($message))) == FALSE ) {
print "SAVE\n";
}
else {
print "DO NOT SAVE\n";
}
I'm calling $config['bad_words']
directly when imploding, not calling config_item
.
Not sure if the modified code above, with all those print statements, might point you in the right direction.
Upvotes: 0
Reputation: 9601
I notice that you have included the shorthand character class \b
in your code. I presume that you wrap these tokens around your bad_words
...
The problem here might be, that the \b
tokens are not matching, because there is no "word-boundary" on the end of badwordz,
as an example; it is a non-word boundary (\B
).
You may have to experiment with different word boundaries, such as whitespace, if that is appropriate.
I would need a better look at the content you are applying your regex to, in order to craft a better expression.
Upvotes: 0