Reputation: 2535
I am making a natural language language processing application in Java, I am using data from IMDB and Amazon.
I came across a certain dataset which has words like partyyyyy
. These words are not good for my classification algorithm. So, I want to remove them and add party
instead of partyyyyyyy
.
How can I do that?
Upvotes: 0
Views: 9304
Reputation: 124295
You can use regex to find letters that have same letter after it at least two times (since we don't want to remove correct letters like m
in comma
)
String data="stoooooop partyyyyyy";
System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
// | | |
// group 1 match replace with
// from match from group 1
// group 1
// repeated
// twice or more
Output:
stop party
Upvotes: 10
Reputation: 5231
You can use this snippet its quite fast implementation.
public static String removeConsecutiveChars(String str) {
if (str == null) {
return null;
}
int strLen = str.length();
if (strLen <= 1) {
return str;
}
char[] strChar = str.toCharArray();
char temp = strChar[0];
StringBuilder stringBuilder = new StringBuilder(strLen);
for (int i = 1; i < strLen; i++) {
char val = strChar[i];
if (val != temp) {
stringBuilder.append(temp);
temp = val;
}
}
stringBuilder.append(temp);
return stringBuilder.toString();
}
Upvotes: 0
Reputation: 252
You may wish to use \p{L}\p{M}* instead of [a-zA-Z] to include non-English unicode letters as well. So it will be like this: replaceAll("(\\p{L}\\p{M}*)(\\1{" + maxAllowedRepetition + ",})", "$1");
or this: replaceAll("(\\p{L}\\p{M}*)\\1{" + maxAllowedRepetition + ",}", "$1");
Upvotes: 0
Reputation: 1608
There are no English words that I know of that have more than two consecutive identical letters.
This approach would not catch:
partyy
"stoop" (plus that's ambiguous! Is that "stop" with an extra "o" or simply "stoop")
Upvotes: 2
Reputation: 21981
Try using loop,
String word="Stoooppppd";
StringBuilder res=new StringBuilder();
char first=word.charAt(0);
res.append(first);
for (int i = 1; i < word.length(); i++) {
char ch=word.charAt(i);
if(ch!=first){
res.append(ch);
}
first=ch;
}
System.out.println(res);
Upvotes: 1