MatBanik
MatBanik

Reputation: 26870

Removing strings from another string in java

Lets say I have this list of words:

 String[] stopWords = new String[]{"i","a","and","about","an","are","as","at","be","by","com","for","from","how","in","is","it","not","of","on","or","that","the","this","to","was","what","when","where","who","will","with","the","www"};

Than I have text

 String text = "I would like to do a nice novel about nature AND people"

Is there method that matches the stopWords and removes them while ignoring case; like this somewhere out there?:

 String noStopWordsText = remove(text, stopWords);

Result:

 " would like do nice novel nature people"

If you know about regex that wold work great but I would really prefer something like commons solution that is bit more performance oriented.

BTW, right now I'm using this commons method which is lacking proper insensitive case handling:

 private static final String[] stopWords = new String[]{"i", "a", "and", "about", "an", "are", "as", "at", "be", "by", "com", "for", "from", "how", "in", "is", "it", "not", "of", "on", "or", "that", "the", "this", "to", "was", "what", "when", "where", "who", "will", "with", "the", "www", "I", "A", "AND", "ABOUT", "AN", "ARE", "AS", "AT", "BE", "BY", "COM", "FOR", "FROM", "HOW", "IN", "IS", "IT", "NOT", "OF", "ON", "OR", "THAT", "THE", "THIS", "TO", "WAS", "WHAT", "WHEN", "WHERE", "WHO", "WILL", "WITH", "THE", "WWW"};
 private static final String[] blanksForStopWords = new String[]{"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""};

 noStopWordsText = StringUtils.replaceEach(text, stopWords, blanksForStopWords);     

Upvotes: 7

Views: 15432

Answers (4)

Theo
Theo

Reputation: 132972

This is a solution that does not use regular expressions. I think it's inferior to my other answer because it is much longer and less clear, but if performance is really, really important then this is O(n) where n is the length of the text.

Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("and");
// and so on ...

String sampleText = "I would like to do a nice novel about nature AND people";
StringBuffer clean = new StringBuffer();
int index = 0;

while (index < sampleText.length) {
  // the only word delimiter supported is space, if you want other
  // delimiters you have to do a series of indexOf calls and see which
  // one gives the smallest index, or use regex
  int nextIndex = sampleText.indexOf(" ", index);
  if (nextIndex == -1) {
    nextIndex = sampleText.length - 1;
  }
  String word = sampleText.substring(index, nextIndex);
  if (!stopWords.contains(word.toLowerCase())) {
    clean.append(word);
    if (nextIndex < sampleText.length) {
      // this adds the word delimiter, e.g. the following space
      clean.append(sampleText.substring(nextIndex, nextIndex + 1)); 
    }
  }
  index = nextIndex + 1;
}

System.out.println("Stop words removed: " + clean.toString());

Upvotes: 4

fastcodejava
fastcodejava

Reputation: 41127

Split text on whilespace. Then loop through the array and keep appending to a StringBuilder only if it is not one of the stop words.

Upvotes: 1

Jigar Joshi
Jigar Joshi

Reputation: 240996

You can make a reg expression to match all the stop words [for example a , note space here]and end up with

str.replaceAll(regexpression,"");

OR

 String[] stopWords = new String[]{" i ", " a ", " and ", " about ", " an ", " are ", " as ", " at ", " be ", " by ", " com ", " for ", " from ", " how ", " in ", " is ", " it ", " not ", " of ", " on ", " or ", " that ", " the ", " this ", " to ", " was ", " what ", " when ", " where ", " who ", " will ", " with ", " the ", " www "};
        String text = " I would like to do a nice novel about nature AND people ";

        for (String stopword : stopWords) {
            text = text.replaceAll("(?i)"+stopword, " ");
        }
        System.out.println(text);

output:

 would like do nice novel nature people 

There might be better way.

Upvotes: 5

Theo
Theo

Reputation: 132972

Create a regular expression with your stop words, make it case insensitive, and then use the matcher's replaceAll method to replace all matches with an empty string

import java.util.regex.*;

Pattern stopWords = Pattern.compile("\\b(?:i|a|and|about|an|are|...)\\b\\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = stopWords.matcher("I would like to do a nice novel about nature AND people");
String clean = matcher.replaceAll("");

the ... in the pattern is just me being lazy, continue the list of stop words.

Another method is to loop over all the stop words and use String's replaceAll method. The problem with that approach is that replaceAll will compile a new regular expression for each call, so it's not very efficient to use in loops. Also, you can't pass the flag that makes the regular expression case insensitive when you use String's replaceAll.

Edit: I added \b around the pattern to make it match whole words only. I also added \s* to make it glob up any spaces after, that's maybe not necessary.

Upvotes: 17

Related Questions