Reputation: 125
Currently I am working on a discord bot, which is filtering messages. My problem occurs when trying to filter words, which are included in others, thus triggering duplicate messages.
This is my filter.txt:
sad
sadness
sadnesses
Since "sad" can be found in "sadness" as well, I get a false-positive for "sad" whenever "sadness" is written.
Is it possible to only detect the exact string in a message? Like: I want to be happy, because sadness is bad
→ 'Just detect sadness'
I hope you understand what i mean.
Code:
public void onGuildMessageReceived(GuildMessageReceivedEvent e) {
File file = new File("src/filter.txt");
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
if(!line.startsWith("#")) {
if(e.getMessage().getContentRaw().contains(line)) {
User user = e.getJDA().getUserById(e.getAuthor().getIdLong());
e.getMessage().delete().queue();
user.openPrivateChannel().queue(privateChannel -> {
privateChannel.sendMessage("Bitte achte auf deine Sprache!").queue();
});
}
}
}
} catch (IOException e1) {}
}
Upvotes: 5
Views: 9281
Reputation: 842
As Cardinal - Reinstate Monica and Hades already said, you should take a look at regex.
'Regex' stands for 'Regular expression' and describes search patterns for strings.
There is a lot you can do using regex, so if you want to know more about it, check out a tutorial.
(It's the first I found when googling, you can use any tutorial of your liking of course.)
For your use case I would suggest the following:
First off, don't use String.contains()
, as it only works with Strings, not with regex.
Use String.matches()
instead with the following regex:
"(?is).*\\bSTRING\\b.*"
Because there is some escaping done, this is what the regex would look like without it:
(?is).*\bSTRING\b.*
I will explain how it works.
\b
\b
matches a word boundary. Word characters are a
- z
, A
- Z
, 0
- 9
and _
. Any combination of this characters is considered a word.
This has the advantage, that you can match the word sad in the following cases:
.
at the end of the sentence doesn't influence the detection..*
.)When using sadness, it won't match sad, as the word continues afterwards:
.*
.
matches any character except some line breaks. ((?s)
helps me out here.)
*
basically says, that the part in front of it occurs zero or more times.
By using a .*
before and after the string, the regex is fine with any character or combination of characters (including no characters) surrounding the string.
That's important, because in this way the words can be placed in every imaginable sentence and will always match not matter what.
(?is)
?i
and ?s
enable certain modes.
?i
makes the regex case insensitive. This means, it doesn't matter if is's sadness, SADNESS or sAdNeSs; all three will match.
?s
enables the 'single line mode', which just means, that .
is matching all line breaks as well.
?i
and ?s
can be combined to (?is)
and then placed in front of the regex.
Instead of STRING
you just have to insert your words like this:
"(?is).*\\b" + line + "\\b.*"
Your code would look like this in the end:
public void onGuildMessageReceived(GuildMessageReceivedEvent e) {
File file = new File("src/filter.txt");
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
if(!line.startsWith("#")) {
if(e.getMessage().getContentRaw().matches("(?is).*\\b" + line + "\\b.*")) {
User user = e.getJDA().getUserById(e.getAuthor().getIdLong());
e.getMessage().delete().queue();
user.openPrivateChannel().queue(privateChannel -> {
privateChannel.sendMessage("Bitte achte auf deine Sprache!").queue();
});
}
}
}
} catch (IOException e1) {}
}
If you want it to only generate one message per message (thus stopping after the first match) you could just insert a return;
after matching a word and after sending the message to the user.
Upvotes: 2
Reputation: 6134
You could also try using a string searching algorithm such as Aho-Corasick, but that would require implementing a proper signature table. An algorithm like this would be a lot better at a bigger list of words.
Note that such algorithms are easily circumvented. Simply adding whitespace or using 1337 character replacement would outsmart a naive word filter.
Upvotes: 0