Kris B
Kris B

Reputation: 3578

Find whole words without regex

I need to find whole words in a sentence, but without using regular expressions. So if I wanted to find the word "the" in this sentence: "The quick brown fox jumps over the lazy dog", I'm currently using:

 String text = "the, quick brown fox jumps over the lazy dog";
 String keyword = "the";

 Matcher matcher = Pattern.compile("\\b"+keyword+"\\b").matcher(text);
 Boolean contains = matcher.find();

but if I used:

Boolean contains = text.contains(keyword);

and pad the keyword with a space, it won't find the first "the" in the sentence, both because it doesn't have surround whitespaces and the punctuations.

To be clear, I'm building an Android app, and I'm getting memory leaks and it might be because I'm using a regular-expression in a ListView, so it's performing a regular-expression match X number of times, depending on the items in the Listview.

Upvotes: 0

Views: 2332

Answers (7)

CodingNinja
CodingNinja

Reputation: 81

I have a project that requires whole word matching, but I can't use regular expressions(because regular expressions escape some keywords), I tried to write my own code to simulate it with non-regular expressions (\bxxx\b), I only know C# and it worked fine.

public static class Finder
{
    public static bool Find(string? input, string? pattern, bool isMatchCase = false, bool isMatchWholeWord = false, bool isMatchRegex = false)
    {
        if (String.IsNullOrWhiteSpace(input) || String.IsNullOrWhiteSpace(pattern))
        {
            return false;
        }

        if (!isMatchCase && !isMatchRegex)
        {
            input = input.ToLower();
            pattern = pattern.ToLower();
        }

        if (isMatchWholeWord && !isMatchRegex)
        {
            int len = pattern.Length;
            int suffix = 0;

            while (true)
            {
                int start = input.IndexOf(pattern, suffix);

                if (start == -1)
                {
                    return false;
                }

                int end = start + len - 1;

                int prefix = start - 1;
                suffix = end + 1;

                bool isPrefixMatched, isSuffixMatched;

                if (start == 0)
                {
                    isPrefixMatched = true;
                }
                else
                {
                    isPrefixMatched = IsWord(input[prefix]) != IsWord(input[start]);
                }

                if (end == input.Length - 1)
                {
                    isSuffixMatched = true;
                }
                else
                {
                    isSuffixMatched = IsWord(input[suffix]) != IsWord(input[end]);
                }

                if (isPrefixMatched && isSuffixMatched)
                {
                    return true;
                }
            }
        }

        if (isMatchRegex)
        {
            if (isMatchWholeWord)
            {
                if (!pattern.StartsWith(@"\b"))
                {
                    pattern = $@"\b{pattern}";
                }

                if (!pattern.EndsWith(@"\b"))
                {
                    pattern = $@"{pattern}\b";
                }
            }

            return Regex.IsMatch(input, pattern, isMatchCase ? RegexOptions.None : RegexOptions.IgnoreCase);
        }

        return input.Contains(pattern);
    }

    private static bool IsWord(char ch)
    {
        return Char.IsLetterOrDigit(ch) || ch == '_';
    }
}

Upvotes: 0

Michael
Michael

Reputation: 11

In the comments of the StringTokenizer.class:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

The following example illustrates how the String.split method can be used to break up a string into its basic tokens:

String[] result = "this is a test".split("\\s");
for (int x=0; x<result.length; x++)
    System.out.println(result[x]);

prints the following output:

this
is
a
test

Iterate through your resulting string array and test for equality and keep a count.

for (String s : result)
{
 count++;
}

If this is a homework assignment, tell your lecturer to read up on Java, times have changed. I remember having the exact same stupid questions during school and it does nothing to prepare you for the real world.

Upvotes: 0

RHT
RHT

Reputation: 5054

Simply iterate over the characters and keep storing them in a char buffer. Every time you see a whitespace, empty the buffer into a list of words and go on till you reach the end.

Upvotes: 0

Stephen C
Stephen C

Reputation: 718788

What you do is search for "the". Then for each match you test to see if the surrounding characters are white space (or punctuation), or if the match is at the beginning / end of the string respectively.

Upvotes: 1

Jason McCreary
Jason McCreary

Reputation: 72981

If you needed to check for multiple words and do it without regular expressions you could use StringTokenizer with a space as the delimiter.

You could then build a custom search method. Otherwise, the other solutions using String.contains() or String.indexOf() qualify.

Upvotes: 1

Jack Edmonds
Jack Edmonds

Reputation: 33171

public int findWholeWorld(final String text, final String searchString) {
    return (" " + text + " ").indexOf(" " + searchString + " ");
}

This will give you the index of the first occurrence of the word "the" or -1 if the word "the" doesn't exist.

Upvotes: 1

Jacob Mattison
Jacob Mattison

Reputation: 51052

Split the string on space, and then see if the resulting array contains your word.

Upvotes: 0

Related Questions