Reputation: 13
I need a TokenCleaner method for the WordCount project that I am doing. A token is a sequence of characters surrounded by whitespace, usually a word, that needsto be "cleaned" of any punctuation and capitalization. I have a template for it but Im not sure how to do or start it.
public class TokenCleaner
{
public static void main()
{
String[] tokens = {"That's","empty-handed?","42","...idk...","\"quote\""};
for(int i = 0; i < tokens.length; i++)
{
System.out.println("Original:\t"+tokens[i]);
System.out.println("Cleaned:\t"+cleanToken(tokens[i]));
}
}
private static String cleanToken(String token)
{
/** remove leading special characters and numbers **/
// while the token's length is greater than zero AND the first character isn't a letter
// remove the first character from the token
/** remove trailing special characters and numbers **/
// while the token's length is greater than zero AND the last character isn't a letter
// remove the last character from the token
// return a lowercase version of the token
/** Note: It is possible for the cleaned token to be an empty String if the given token
consisted of only non-letter characters */
return null; // placeholder return statement
}
Can someone please help?
Thank you
Upvotes: 0
Views: 233
Reputation: 881
I can suggest you to parse every caracter , and if its equal to anything you want to delete you can delete it , and if not lowercase it , for instance :
private static String cleanToken(String token) {
// arraylist of new token
ArrayList<String> newtoken = new ArrayList<String>();
// arraylist of elements you wanna delete
ArrayList<String> todelete = new ArrayList<String>();
todelete.add("@"); // you can add all element u wanna delete
// parsing your token
for(int i=0 ; i < token.lentgh() ; i++ ) {
if ( todelete.contains( token.charAt(i) ) ) {
// you can delete it in the way you want
}
else {
// lowercase it
newtoken.add( (token.charAt(i)).toString().toLowerCase() ) ;
}
}
// and now you can merge all elements of your newtoken list to one String
String NewToken = "";
for ( String t : newtoken ) {
NewToken = NewToken + t ;
}
return NewToken;
}
Upvotes: 0
Reputation: 719307
I am not sure how to do or start it.
You can implement this by pattern matching. Start by reading the javadocs for Pattern
(which implements Java regexes) and the String.replaceAll
method.
Alternatively, you can create a new (empty) StringBuilder
, then loop over the characters in the original string copying the characters that you want to keep into the StringBuilder
. When you are finished, create a String
from the StringBuilder
.
I am not going to give you links to the relevant javadocs. Finding them, searching them, and reading / understanding them are skills that you need to learn.
Upvotes: 1
Reputation: 161
I'm not sure if this is the above requirement, but you can have the method like:
private static String cleanToken(String token)
{
return token.replaceAll("\\P{L}", "").toLowerCase();
}
But this will remove the number and special character from all places, not only start and end of the token.
Do let me know if this helps.
Upvotes: 1