Reputation: 34360
I have an array of string that I load throughout my application, and it contains different words. I have a simple if statement to see if it contains letters or numbers but not words .
I mean i only want those words which is like AB2CD5X
.. and i want to remove all other words like Hello 3
, 3 word
, any other
words which is a word in English. Is it possible to filter only alphaNumeric words except those words which contain real grammar word.
i know how to check whether string contains alphanumeric words
Pattern p = Pattern.compile("[\\p{Alnum},.']*");
also know
if(string.contains("[a-zA-Z]+") || string.contains([0-9]+])
Upvotes: 4
Views: 7936
Reputation: 544
You may try this,
First tokenize the string using StringTokenizer
with default delimiter, for each token if it contains only digits or only characters, discard it, remaining will be the words which contains combination of both digits and characters. For identifying only digits only characters you can have regular expressions used.
Upvotes: 0
Reputation: 1016
if(string.contains("[a-zA-Z]+") || string.contains([0-9]+])
I think this is a good starting point, but since you're looking for strings that contain both letters and numbers you might want:
if(string.contains("[a-zA-Z]+") && string.contains([0-9]+])
I guess you might also want to check if there are spaces? Right? Because you that could indicate that there are separate words or some sequence like 3 word
. So maybe in the end you could use:
if(string.contains("[a-zA-Z]+") && string.contains([0-9]+] && !string.contains(" "))
Hope this helps
Upvotes: 0
Reputation: 30995
You can use Cambridge Dictionaries to verify human words. In this case, if you find a "human valid" word you can skip it.
As the documentation says, to use the library, you need to initialize a request handler and an API object:
DefaultHttpClient httpClient = new DefaultHttpClient(new ThreadSafeClientConnManager());
SkPublishAPI api = new SkPublishAPI(baseUrl + "/api/v1", accessKey, httpClient);
api.setRequestHandler(new SkPublishAPI.RequestHandler() {
public void prepareGetRequest(HttpGet request) {
System.out.println(request.getURI());
request.setHeader("Accept", "application/json");
}
});
To use the "api" object:
try {
System.out.println("*** Dictionaries");
JSONArray dictionaries = new JSONArray(api.getDictionaries());
System.out.println(dictionaries);
JSONObject dict = dictionaries.getJSONObject(0);
System.out.println(dict);
String dictCode = dict.getString("dictionaryCode");
System.out.println("*** Search");
System.out.println("*** Result list");
JSONObject results = new JSONObject(api.search(dictCode, "ca", 1, 1));
System.out.println(results);
System.out.println("*** Spell checking");
JSONObject spellResults = new JSONObject(api.didYouMean(dictCode, "dorg", 3));
System.out.println(spellResults);
System.out.println("*** Best matching");
JSONObject bestMatch = new JSONObject(api.searchFirst(dictCode, "ca", "html"));
System.out.println(bestMatch);
System.out.println("*** Nearby Entries");
JSONObject nearbyEntries = new JSONObject(api.getNearbyEntries(dictCode,
bestMatch.getString("entryId"), 3));
System.out.println(nearbyEntries);
} catch (Exception e) {
e.printStackTrace();
}
Upvotes: 1
Reputation: 596
What you need is a dictionary of English words. Then you basically scan your input and check if each token exists in your dictionary. You can find text files of dictionary entries online, such as in Jazzy spellchecker. You might also check Dictionary text file.
Here is a sample code that assumes your dictionary is a simple text file in UTF-8 encoding with exactly one (lower case) word per line:
public static void main(String[] args) throws IOException {
final Set<String> dictionary = loadDictionary();
final String text = loadInput();
final List<String> output = new ArrayList<>();
// by default splits on whitespace
final Scanner scanner = new Scanner(text);
while(scanner.hasNext()) {
final String token = scanner.next().toLowerCase();
if (!dictionary.contains(token)) output.add(token);
}
System.out.println(output);
}
private static String loadInput() {
return "This is a 5gse5qs sample f5qzd fbswx test";
}
private static Set<String> loadDictionary() throws IOException {
final File dicFile = new File("path_to_your_flat_dic_file");
final Set<String> dictionaryWords = new HashSet<>();
String line;
final LineNumberReader reader = new LineNumberReader(new BufferedReader(new InputStreamReader(new FileInputStream(dicFile), "UTF-8")));
try {
while ((line = reader.readLine()) != null) dictionaryWords.add(line);
return dictionaryWords;
}
finally {
reader.close();
}
}
If you need more accurate results, you need to extract stems of your words. See Apache's Lucene and EnglishStemmer
Upvotes: 5
Reputation: 303
Antlr might help you. Antlr stands for ANother Tool for Language Recognition
Hibernate uses ANTLR to parse its query language HQL(like SELECT,FROM).
Upvotes: 0