Reputation: 44978
How can I check that all the words from string #2 exist in String #1? It should be case insensitive and I want exclude all punctuation and special characters during comparison of words.
Any help?
Thanks.
Upvotes: 0
Views: 1537
Reputation: 80176
While the algorithm to do this is simple, the implementation is more involved if you want to support multiple locales. Below is a sample code that supports multiple locales. I've verified this with English as well as Chinese (But I am not sure if it passes the Turkey Test ;-)). Anyways the below code needs some refactoring but this will get you started.
NOTE: Even if you doesn't want support for other languages than English, I still would use the below as the word boundarie/punctuations/grammar etc are locale/language dependent which might not be well addressed by StringTokenizer, String.split(...) and other basic APIs.
import java.text.BreakIterator;
import java.text.Collator;
import java.util.Locale;
import java.util.Set;
import java.util.TreeSet;
import org.apache.commons.lang.StringEscapeUtils;
public class UnicodeWordCount
{
public static void main(final String[] args)
{
testEnglish();
testChinese();
}
public static void testEnglish()
{
BreakIterator wordIterator = BreakIterator.getWordInstance(Locale.ENGLISH);
String str = "This is the source string";
String match = "source string is this";
String doesntMatch = "from Pangea";
Set<String> uniqueWords = extractWords(str, wordIterator, Locale.ENGLISH);
printWords(uniqueWords);
System.out.println("Should print true: " + contains(match, wordIterator, uniqueWords));
System.out.println("Should print false: " + contains(doesntMatch, wordIterator, uniqueWords));
}
public static void testChinese()
{
BreakIterator wordIterator = BreakIterator.getWordInstance(Locale.CHINESE);
String str = "\u4E0D\u70BA\u6307\u800C\u8B02\u4E4B\u6307\uFF0C\u662F[\u7121\u90E8]\u70BA\u6307\u3002\u201D\u5176\u539F\u6587\u70BA";
String match = "\u5176\u539F\u6587\u70BA\uFF0C\u70BA\u6307";
String doesntMatch = "\u4E0D\u70BA\u6307\u800C\u8B02\u4E4B\u6307\uFF0C\u662F[\u517C\u4E0D]\u70BA\u6307\u3002";
Set<String> uniqueWords = extractWords(str, wordIterator, Locale.CHINESE);
printWords(uniqueWords);
System.out.println("Should print true: " + contains(match, wordIterator, uniqueWords));
System.out.println("Should print false: " + contains(doesntMatch, wordIterator, uniqueWords));
}
public static Set<String> extractWords(final String input, final BreakIterator wordIterator, final Locale desiredLocale)
{
Collator collator = Collator.getInstance(desiredLocale);
collator.setStrength(Collator.PRIMARY);
Set<String> uniqueWords = new TreeSet<String>(collator);
wordIterator.setText(input);
int start = wordIterator.first();
int end = wordIterator.next();
while (end != BreakIterator.DONE)
{
String word = input.substring(start, end);
if (Character.isLetterOrDigit(word.charAt(0)))
{
uniqueWords.add(word);
}
start = end;
end = wordIterator.next();
}
return uniqueWords;
}
public static boolean contains(final String target, final BreakIterator wordIterator, final Set<String> uniqueWords)
{
wordIterator.setText(target);
int start = wordIterator.first();
int end = wordIterator.next();
while (end != BreakIterator.DONE)
{
String word = target.substring(start, end);
if (Character.isLetterOrDigit(word.charAt(0)))
{
if (!uniqueWords.contains(word))
{
return false;
}
}
start = end;
end = wordIterator.next();
}
return true;
}
private static void printWords(final Set<String> uniqueWords)
{
for (String word : uniqueWords)
{
System.out.println(StringEscapeUtils.escapeJava(word));
}
}
}
Upvotes: 0
Reputation:
isContainsAll(s1, s2)
1 . split s2 by " "; s.split("")
2 . check if s1 contains all the element of s2
public static boolean isContainsAll(String s1, String s2){
String[] split = s2.split(" ");
for(int i=0; i<split.length; i++){
if(!s1.contains(split[i])){
return false;
}
}
return true;
}
public static void main(String... args){
System.out.println(isContainsAll("asd dsasda das asd; asds asd;/ ", "asd;/"));
}
Upvotes: 0
Reputation: 5650
You can try String's built-in split method
it looks like
public String[] split(String regex)
and it returns an array of Strings based on the regular expression you use. There are examples in the link above.
You can easily generate two arrays this way (one for String #1 and one for String #2).
Sort the arrays and then check if the arrays are equal. (size and order)
You can simplify array sorting if you utilize java.util.Arrays
Arrays in Java have a lot of library methods and you should learn about them because they are incredibly useful sometimes: http://leepoint.net/notes-java/data/arrays/arrays-library.html
This is slightly less efficient than building a dictionary/hash table/ADT with your selected delimiters (like in MattDiPasquale's answer), but it might be easier to understand if you are not very familiar with hash functions or dictionaries (as a datatype).
Upvotes: 0
Reputation: 103135
To find the words in a String while ignoring the various punctuations etc you can use the StringTokenizer class.
StringTokenizer st = new StringTokenizer("Your sentence;with whatever. punctuations? might exists", " :?.,-+=[]");
This breaks up the String into Tokens using the delimiters provided in the second example. You can then use hasMoreTokens() and nextToken() method to iterate the tokens.
Then you can use the algorithm suggested by @MattDiPasquale.
Upvotes: 0
Reputation: 126327
Running time: O(n)
I'll let someone else implement this in Java.
Upvotes: 7