integ specialist
integ specialist

Reputation: 105

Fastest way to search several strings in a string

Below is my code to find the occurrences of all the substrings in a given single string

public static void main(String... args) {
    String fullString = "one is a good one. two is ok. three is three. four is four. five is not four";
    String[] severalStringArray = { "one", "two", "three", "four" };
    Map<String, Integer> countMap = countWords(fullString, severalStringArray);
}

public static Map<String, Integer> countWords(String fullString, String[] severalStringArray) {
    Map<String, Integer> countMap = new HashMap<>();

    for (String searchString : severalStringArray) {
        if (countMap.containsKey(searchString)) {
            int searchCount = countMatchesInString(fullString, searchString);
            countMap.put(searchString, countMap.get(searchString) + searchCount);
        } else
            countMap.put(searchString, countMatchesInString(fullString, searchString));
    }

    return countMap;
}

private static int countMatchesInString(String fullString, String subString) {
    int count = 0;
    int pos = fullString.indexOf(subString);
    while (pos > -1) {
        count++;
        pos = fullString.indexOf(subString, pos + 1);
    }
    return count;
}

Assume the full string might be a full file read as a string. Is the above is the efficient way of search or any other better way or fastest way to do it?

Thanks

Upvotes: 4

Views: 878

Answers (4)

user4910279
user4910279

Reputation:

An implementation of the Trie tree that someone commented on. I don't know if it's fast or not.

static class Trie {

    static final long INC_NODE_NO = 1L << Integer.SIZE;

    private long nextNodeNo = 0;
    private Node root = new Node();
    private final Map<Long, Node> nodes = new HashMap<>();

    public void put(String word) {
        Node node = root;
        for (int i = 0, len = word.length(); i < len; ++i)
            node = node.put(word.charAt(i));
        node.data = word;
    }

    public List<String> findPrefix(String text, int start) {
        List<String> result = new ArrayList<>();
        Node node = root;
        for (int i = start, length = text.length(); i < length; ++i) {
            if ((node = node.get(text.charAt(i))) == null)
                break;
            String v = node.data;
            if (v != null)
                result.add(v);
        }
        return result;
    }

    public Map<String, Integer> find(String text) {
        Map<String, Integer> result = new HashMap<>();
        for (int i = 0, length = text.length(); i < length; ++i)
            for (String w : findPrefix(text, i))
                result.compute(w, (k, v) -> v == null ? 1 : v + 1);
        return result;
    }

    class Node {
        final long no;
        String data;

        Node() {
            this.no = nextNodeNo;
            nextNodeNo += INC_NODE_NO;
        }

        Node get(int key) {
            return nodes.get(no | key);
        }

        Node put(int key) {
            return nodes.computeIfAbsent(no | key, k -> new Node());
        }
    }
}

public static void main(String args[]) throws IOException {
    String fullString = "one is a good one. two is ok. three is three. four is four. five is not four";
    String[] severalStringArray = { "one", "two", "three", "four" };
    Trie trie = new Trie();
    for (String word : severalStringArray)
        trie.put(word);
    Map<String, Integer> count = trie.find(fullString);
    System.out.println(count);
}

output:

{four=3, one=2, three=2, two=1}

Upvotes: 0

Oleg Cherednik
Oleg Cherednik

Reputation: 18245

Actually, the fastest way is to scan the string first and count all existed words and save it into Map. Then select required words only.

Just be simple! The regular expression is too complicated and not efficient for this simple task. Let's solve it with a hummer!

public static void main(String... args) {
    String str = "one is a good one. two is ok. three is three. four is four. five is not four";
    Set<String> words = Set.of("one", "two", "three", "four");
    Map<String, Integer> map = countWords(str, words);
}

public static Map<String, Integer> countWords(String str, Set<String> words) {
    Map<String, Integer> map = new HashMap<>();

    for (int i = 0, j = 0; j <= str.length(); j++) {
        char ch = j == str.length() ? '\0' : str.charAt(j);

        if (j == str.length() || !isWordSymbol(ch)) {
            String word = str.substring(i, j);

            if (!word.isEmpty() && words.contains(word))
                map.put(word, map.getOrDefault(word, 0) + 1);

            i = j + 1;
        }
    }

    return map;
}

private static boolean isWordSymbol(char ch) {
    return Character.isLetter(ch) || ch == '-' || ch == '_';
}

Upvotes: 0

Marte Valerio Falcone
Marte Valerio Falcone

Reputation: 455

Regular expressions

Your problem can be solved using regex (regular expressions). Regular expressions are a tool that help you matching patterns in strings. This pattern can be a word or can be a set of chars.

Regular expressions in Java

In Java there are two Objects helping you with regular expressions: Pattern and Matcher.

Below you can see an example for searching if the word stackoverflow exists in the string stackoverflowXstackoverflowXXXstackoverflowXX in Java.

String pattern = "stackoverflow";
String stringToExamine = "stackoverflowXstackoverflowXXXstackoverflowXX";

Pattern patternObj = Pattern.compile(pattern);
Matcher matcherObj = patternObj.matcher(stringToExamine);

Counting how many occurrencies of a word in a given string

As written here you have different solution based on your Java version:

Java 9+

long matches = matcherObj.results().count();

Older Java versions

int count = 0;
while (matcherObj.find())
    count++;

Regular expressions in your problem

You use a method for calculating how many times a word is occurring in a text (a string), and you can modify it like this:

Java 9+

public static int matchesInString(String fullString, String pattern)
{
    Pattern patternObj = Pattern.compile(pattern);
    Matcher matcherObj = patternObj.matcher(fullString);
    
    return matcherObj.results().count();
}

Older Java versions

public static int matchesInString(String fullString, String pattern)
{
    int count = 0;

    Pattern patternObj = Pattern.compile(pattern);
    Matcher matcherObj = patternObj.matcher(fullString);
    
    while (matcherObj.find())
        count++;
        
    return count;
}

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520878

You could just form a regex alternation of words to search, and then do a single search against that regex:

public static int matchesInString(String fullString, String regex) {
    int count = 0;

    Pattern r = Pattern.compile(regex);
    Matcher m = r.matcher(fullString);

    while (m.find())
        ++count;

    return count;
}

String fullString = "one is a good one. two is ok. three is three. four is four. five is not four";
String[] severalStringArray = { "one", "two", "three", "four" };
String regex = "\\b(?:" + String.join("|", severalStringArray) + ")\\b";

int count = matchesInString(fullString, regex);
System.out.println("There were " + count + " matches in the input");

This prints:

There were 8 matches in the input

Note that the regex pattern used in the above example was:

\b(?:one|two|three|four)\b

Upvotes: 3

Related Questions