Connor Spencer Harries
Connor Spencer Harries

Reputation: 878

Running multiple regex patterns on String

Assuming I have a List<String> and an empty List<Pattern>, is this the best way to handle making the words in the String into Pattern objects;

for(String word : stringList) {
    patterns.add(Pattern.compile("\\b(" + word + ")\\b);
}

And then to run this on a string later;

for(Pattern pattern : patterns) {
    Matcher matcher = pattern.matcher(myString);
    if(matcher.matches()) {
         myString = matcher.replaceAll("String[$1]");
    }
}

The replaceAll bit is just an example, but $1 would be used most of the the time when I use this.

Is there a more efficient way? Because I feel like this is somewhat clunky. I'm using 80 Strings in the list by the way, though the Strings used are configurable, so there won't always be so many.

This is designed to be somewhat of a swearing filter so I'll let you assume the words in the List,

An example of input would be "You're a <curse>", the output would be "You're a *****" for this word, though this may not always be the case and at some point I may be reading from a HashMap<String, String>where the key is the capture group and the value is the replacement.

Example:

if(hashMap.get(matcher.group(1)) == null) { 
    // Can't test if \ is required. Used it here for safe measure.
    matcher.replaceAll("\*\*\*\*");
 } else {
    matcher.replaceAll(hashMap.get(matcher.group(1));
 }

Upvotes: 4

Views: 8425

Answers (3)

SkateScout
SkateScout

Reputation: 870

the Idee from Boann was already good. But for example for logfiltering i have an large list of filters there the text matched against regex and i need to know what filter matched. For i encode the other filter like modul, code, level etc. also as regex. And if there is an match i check what group matched.

1) So each line is only checked once.

2) Since all regex are build into one matcher each char is only checked one.

This is an extream improvement from N (number of conditions) to nearly 1 (constant for nearly any number of filters).

public static void main(final String[] argc) throws Throwable {
    Config c;
    try(InputStream s = new FileInputStream("webapp/WEB-INF/logScanConfig.xml")) { c = (Config) JAXBContext.newInstance(Config.class).createUnmarshaller().unmarshal(s); }
    final LineContext[] a = c.rules.toArray(new LineContext[c.rules.size()]);
    final StringBuilder regex = new StringBuilder();
    for(int i=0;i<a.length;i++) {
        final LineContext e = a[i];
        final String p ="(^"+
                (e.modul == null?".*":e.modul)+" ; "+
                (e.code  == null?".*":e.code )+" ; "+
                (e.mesg  == null?".*":e.mesg )+" ; "+
                (e.level == null?".*":e.level)+" ; "+
                (e.regex == null?".*":e.regex)+"$)";
        if(regex.length()>0) regex.append("|");
        regex.append(p);
    }

    final Pattern pattern = Pattern.compile(regex.toString(), Pattern.DOTALL);
    final Matcher m = pattern.matcher("ISS ; 0025 ; 0008 ; I ; State Manager started");
    if(!m.matches()) {
        System.out.println("Not Found");
    } else {
        System.out.println("GroupCount: "+m.groupCount()+" A["+a.length+"]");
        for(int i=1;i<=m.groupCount();i++) {
            if(null != m.group(i)) {
                System.out.println("GROUP["+(i-1)+"]: "+m.group(i));
                System.out.println(a[i-1]);
            }
        }
    }
  }
}

Here an example for logScanConfig.xml

<logScanConfig user="private.1" pass="private.2">
 <logUrls>
  <e>http://private.3:80/fetch/log</e>
  <e>http://private.4:80/fetch/log</e>
  <e>http://private.5:80/fetch/log</e>
 </logUrls>
 <rules>
  <e backlogTime='600' minCount='0' maxCount='0' modul='ART' code='0114' mesg='1007' level='E'><regex>.*ORA-27101: shared memory realm does not exist.*</regex></e>
  <e backlogTime='600' minCount='0' maxCount='0' modul='ISS' code='0098'             level='E'><regex>Insufficient memory .*</regex></e>
 </rules>
</logScanConfig>

Upvotes: 2

Boann
Boann

Reputation: 50041

You can join these patterns together using alternation with |:

Pattern pattern = Pattern.compile("\\b(" + String.join("|",stringList) + ")\\b");

If you cannot use Java 8 so do not have the String.join method, or if you need to escape the words to prevent characters in them from being interpreted as regex metacharacters, you will need to build this regex with a manual loop:

StringBuilder regex = new StringBuilder("\\b(");
for (String word : stringList) {
    regex.append(Pattern.quote(word));
    regex.append("|");
}
regex.setLength(regex.length() - 1); // delete last added "|"
regex.append(")\\b");
Pattern pattern = Pattern.compile(regex.toString());

To use different replacements for the different words, you can apply the pattern with this loop:

Matcher m = pattern.matcher(myString);
StringBuilder out = new StringBuilder();
int pos = 0;
while (m.find()) {
    out.append(myString, pos, m.start());
    String matchedWord = m.group(1);
    String replacement = matchedWord.replaceAll(".", "*");
    out.append(replacement);
    pos = m.end();
}
out.append(myString, pos, myString.length());
myString = out.toString();

You can look up the replacement for the matched word any way you like. The example generates a replacement string of asterisks of the same length as the matched word.

Upvotes: 5

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726809

If you do the same thing no matter what word is matched, you could compose a big "OR" expression from your words, and use a single pattern, like this:

\\b(<word1>|<word2>|...|<wordN>)\\b

where <wordK> should be replaced with your words in a loop:

StringBuilder res = new StringBuilder("\\b(");
boolean first = true;
for(String word : stringList) {
    if (!first) {
        res.append("|");
    } else {
        first = false;
    }
    res.append(word);
}
res.append(")\\b");
Pattern p = Pattern.compile(res.toString());

Note: This solution assumes that words are free of regex metacharacters.

Upvotes: 1

Related Questions