Fxguy1
Fxguy1

Reputation: 85

Java Searching String Contents for partial match

I'm working on a project where I need to search a paragraph of text for a particular string. However, I don't need an exact match, more of a % match.

For example, here is the paragraph of text I'm searching:

Fluticasone Propionate Nasal Spray, USP 50 mcg per spray is a 
corticosteroid indicated for the management of the nasal symptoms of 
perennial nonallergic rhinitis in adult and pediatric patients aged 4 years 
and older."

And then I'm searching to see if any words in the following lines match the paragraph:

1)Unspecified acute lower respiratory infection
2)Vasomotor rhinitis
3)Allergic rhinitis due to pollen
4)Other seasonal allergic rhinitis
5)Allergic rhinitis due to food
6)Allergic rhinitis due to animal (cat) (dog) hair and dander
7)Other allergic rhinitis
8)"Allergic rhinitis, unspecified"
9)Chronic rhinitis
10)Chronic nasopharyngitis

My initial approach to this was using a boolean and contains:

boolean found = med[x].toLowerCase().contains(condition[y].toLowerCase());

however, the results are negative for each loop through.

The results I expect would be:

1) False
2) True
3) True
4) True
5) True
6) True
7) True
8) True
9) True
10) False

Very new to Java and its methods. Basically if any word in A matches any word in B then flag it as true. How do I do that?

Thanks!

Upvotes: 3

Views: 2559

Answers (3)

BretC
BretC

Reputation: 4199

This will give you a 'crude' match percentage.

Here's how it works:

  1. Split the text to search and the search term into a set of words. This is done by splitting using a regular expression. Each word is converted to upper case and added to a set.

  2. Count how many words in the search term appears in the text.

  3. Calculate the percentage of words in the search term that appear in the text.

You might want to enhance this by stripping out common words like 'a', 'the' etc.

    import java.util.Arrays;
    import java.util.Set;
    import java.util.stream.Collectors;

    public class CrudeTextMatchThingy {

        public static void main(String[] args) {
            String searchText = "Fluticasone Propionate Nasal Spray, USP 50 mcg per spray is a \n" +
                    "corticosteroid indicated for the management of the nasal symptoms of \n" +
                    "perennial nonallergic rhinitis in adult and pediatric patients aged 4 years \n" +
                    "and older.";

            String[] searchTerms = {
                "Unspecified acute lower respiratory infection",
                "Vasomotor rhinitis",
                "Allergic rhinitis due to pollen",
                "Other seasonal allergic rhinitis",
                "Allergic rhinitis due to food",
                "Allergic rhinitis due to animal (cat) (dog) hair and dander",
                "Other allergic rhinitis",
                "Allergic rhinitis, unspecified",
                "Chronic rhinitis",
                "Chronic nasopharyngitis"
            };

            Arrays.stream(searchTerms).forEach(searchTerm -> {
                double matchPercent = findMatch(searchText, searchTerm);
                System.out.println(matchPercent + "% - " + searchTerm);
            });
        }

        private static double findMatch(String searchText, String searchTerm) {
            Set<String> wordsInSearchText = getWords(searchText);
            Set<String> wordsInSearchTerm = getWords(searchTerm);

            double wordsInSearchTermThatAreFound = wordsInSearchTerm.stream()
                    .filter(s -> wordsInSearchText.contains(s))
                    .count();

            return (wordsInSearchTermThatAreFound / wordsInSearchTerm.size()) * 100.0;
        }

        private static Set<String> getWords(String term) {
            return Arrays.stream(term.split("\\b"))
                    .map(String::trim)
                    .map(String::toUpperCase)
                    .filter(s -> s.matches("[A-Z0-9]+"))
                    .collect(Collectors.toSet());
        }
    }

Output:

    0.0% - Unspecified acute lower respiratory infection
    50.0% - Vasomotor rhinitis
    20.0% - Allergic rhinitis due to pollen
    25.0% - Other seasonal allergic rhinitis
    20.0% - Allergic rhinitis due to food
    20.0% - Allergic rhinitis due to animal (cat) (dog) hair and dander
    33.33333333333333% - Other allergic rhinitis
    33.33333333333333% - Allergic rhinitis, unspecified
    50.0% - Chronic rhinitis
    0.0% - Chronic nasopharyngitis

If you do not want a percentage, but true or false, you can just do...,

    boolean matches = findMatch(searchText, searchTerm) > 0.0;

Hope this helps.

Upvotes: 1

jbx
jbx

Reputation: 22128

You have to first tokenize one of the strings. What you are doing now is trying to match the whole line.

Something like this should work:

String text = med[x].toLowerCase();
boolean found = 
  Arrays.stream(condition[y].split(" "))      
      .map(String::toLowerCase)
      .map(s -> s.replaceAll("\\W", "")
      .filter(s -> !s.isEmpty())
      .anyMatch(text::contains);

I've added the removal of punctuation characters, and any blank strings, so that we don't have false matches on those. (The \\W actually removes characters that are not in [A-Za-z_0-9], but you can change it to whatever you like).

If you need this to be efficient, because you have a lot of text, you might want to turn it around and use a Set which has a faster lookup.

private Stream<String> tokenize(String s) {
   return Arrays.stream(s.split(" "))
                .map(String::toLowerCase)
                .map(s -> s.replaceAll("\\W", "")
                .filter(s -> !s.isEmpty());                   
}

Set<String> words =  tokenize(med[x]).collect(Collectors.toSet());

boolean found = tokenize(condition[y]).anyMatch(words::contains);

You might also want to filter out stop words, like to, and etc. You could use the list here and add an extra filter after the one that checks for blank strings, to check that the string is not a stop word.

Upvotes: 2

M. Goodman
M. Goodman

Reputation: 151

If you construct a list with the searchable words this would be a lot easier. Supposing your paragraph is stored as a String:

ArrayList<String> dictionary = new ArrayList<>();
dictionary.add("acute lower respiratory infection");
dictionary.add("rhinitis");
for(int i =0; i<dictionary.size(); i++){
    if(paragraph.contains(dictionary.get(i))){
        System.out.println(i + "True");
    }
    else{
         System.out.println(i +"False");
    }
}

Upvotes: 0

Related Questions