Reputation: 105
Below is my code to find the occurrences of all the substrings in a given single string
public static void main(String... args) {
String fullString = "one is a good one. two is ok. three is three. four is four. five is not four";
String[] severalStringArray = { "one", "two", "three", "four" };
Map<String, Integer> countMap = countWords(fullString, severalStringArray);
}
public static Map<String, Integer> countWords(String fullString, String[] severalStringArray) {
Map<String, Integer> countMap = new HashMap<>();
for (String searchString : severalStringArray) {
if (countMap.containsKey(searchString)) {
int searchCount = countMatchesInString(fullString, searchString);
countMap.put(searchString, countMap.get(searchString) + searchCount);
} else
countMap.put(searchString, countMatchesInString(fullString, searchString));
}
return countMap;
}
private static int countMatchesInString(String fullString, String subString) {
int count = 0;
int pos = fullString.indexOf(subString);
while (pos > -1) {
count++;
pos = fullString.indexOf(subString, pos + 1);
}
return count;
}
Assume the full string might be a full file read as a string. Is the above is the efficient way of search or any other better way or fastest way to do it?
Thanks
Upvotes: 4
Views: 878
Reputation:
An implementation of the Trie tree that someone commented on. I don't know if it's fast or not.
static class Trie {
static final long INC_NODE_NO = 1L << Integer.SIZE;
private long nextNodeNo = 0;
private Node root = new Node();
private final Map<Long, Node> nodes = new HashMap<>();
public void put(String word) {
Node node = root;
for (int i = 0, len = word.length(); i < len; ++i)
node = node.put(word.charAt(i));
node.data = word;
}
public List<String> findPrefix(String text, int start) {
List<String> result = new ArrayList<>();
Node node = root;
for (int i = start, length = text.length(); i < length; ++i) {
if ((node = node.get(text.charAt(i))) == null)
break;
String v = node.data;
if (v != null)
result.add(v);
}
return result;
}
public Map<String, Integer> find(String text) {
Map<String, Integer> result = new HashMap<>();
for (int i = 0, length = text.length(); i < length; ++i)
for (String w : findPrefix(text, i))
result.compute(w, (k, v) -> v == null ? 1 : v + 1);
return result;
}
class Node {
final long no;
String data;
Node() {
this.no = nextNodeNo;
nextNodeNo += INC_NODE_NO;
}
Node get(int key) {
return nodes.get(no | key);
}
Node put(int key) {
return nodes.computeIfAbsent(no | key, k -> new Node());
}
}
}
public static void main(String args[]) throws IOException {
String fullString = "one is a good one. two is ok. three is three. four is four. five is not four";
String[] severalStringArray = { "one", "two", "three", "four" };
Trie trie = new Trie();
for (String word : severalStringArray)
trie.put(word);
Map<String, Integer> count = trie.find(fullString);
System.out.println(count);
}
output:
{four=3, one=2, three=2, two=1}
Upvotes: 0
Reputation: 18245
Actually, the fastest way is to scan the string first and count all existed words and save it into Map
. Then select required words only.
Just be simple! The regular expression
is too complicated and not efficient for this simple task. Let's solve it with a hummer!
public static void main(String... args) {
String str = "one is a good one. two is ok. three is three. four is four. five is not four";
Set<String> words = Set.of("one", "two", "three", "four");
Map<String, Integer> map = countWords(str, words);
}
public static Map<String, Integer> countWords(String str, Set<String> words) {
Map<String, Integer> map = new HashMap<>();
for (int i = 0, j = 0; j <= str.length(); j++) {
char ch = j == str.length() ? '\0' : str.charAt(j);
if (j == str.length() || !isWordSymbol(ch)) {
String word = str.substring(i, j);
if (!word.isEmpty() && words.contains(word))
map.put(word, map.getOrDefault(word, 0) + 1);
i = j + 1;
}
}
return map;
}
private static boolean isWordSymbol(char ch) {
return Character.isLetter(ch) || ch == '-' || ch == '_';
}
Upvotes: 0
Reputation: 455
Your problem can be solved using regex (regular expressions). Regular expressions are a tool that help you matching patterns in strings. This pattern can be a word or can be a set of chars.
In Java there are two Objects helping you with regular expressions: Pattern
and Matcher
.
Below you can see an example for searching if the word stackoverflow exists in the string stackoverflowXstackoverflowXXXstackoverflowXX in Java.
String pattern = "stackoverflow";
String stringToExamine = "stackoverflowXstackoverflowXXXstackoverflowXX";
Pattern patternObj = Pattern.compile(pattern);
Matcher matcherObj = patternObj.matcher(stringToExamine);
As written here you have different solution based on your Java version:
Java 9+
long matches = matcherObj.results().count();
Older Java versions
int count = 0; while (matcherObj.find()) count++;
You use a method for calculating how many times a word is occurring in a text (a string), and you can modify it like this:
public static int matchesInString(String fullString, String pattern)
{
Pattern patternObj = Pattern.compile(pattern);
Matcher matcherObj = patternObj.matcher(fullString);
return matcherObj.results().count();
}
public static int matchesInString(String fullString, String pattern)
{
int count = 0;
Pattern patternObj = Pattern.compile(pattern);
Matcher matcherObj = patternObj.matcher(fullString);
while (matcherObj.find())
count++;
return count;
}
Upvotes: 1
Reputation: 520878
You could just form a regex alternation of words to search, and then do a single search against that regex:
public static int matchesInString(String fullString, String regex) {
int count = 0;
Pattern r = Pattern.compile(regex);
Matcher m = r.matcher(fullString);
while (m.find())
++count;
return count;
}
String fullString = "one is a good one. two is ok. three is three. four is four. five is not four";
String[] severalStringArray = { "one", "two", "three", "four" };
String regex = "\\b(?:" + String.join("|", severalStringArray) + ")\\b";
int count = matchesInString(fullString, regex);
System.out.println("There were " + count + " matches in the input");
This prints:
There were 8 matches in the input
Note that the regex pattern used in the above example was:
\b(?:one|two|three|four)\b
Upvotes: 3