Mr_Hmp
Mr_Hmp

Reputation: 2535

Replace multiple consecutive occurrences of a character with a single occurrence

I am making a natural language language processing application in Java, I am using data from IMDB and Amazon.

I came across a certain dataset which has words like partyyyyy. These words are not good for my classification algorithm. So, I want to remove them and add party instead of partyyyyyyy.

How can I do that?

Upvotes: 0

Views: 9304

Answers (5)

Pshemo
Pshemo

Reputation: 124295

You can use regex to find letters that have same letter after it at least two times (since we don't want to remove correct letters like m in comma)

String data="stoooooop partyyyyyy";
System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
//                                       |      |         |
//                                   group 1   match    replace with 
//                                             from     match from group 1
//                                             group 1
//                                             repeated 
//                                           twice or more

Output:

stop party

Upvotes: 10

Ertuğrul Çetin
Ertuğrul Çetin

Reputation: 5231

You can use this snippet its quite fast implementation.

public static String removeConsecutiveChars(String str) {

        if (str == null) {
            return null;
        }

        int strLen = str.length();
        if (strLen <= 1) {
            return str;
        }

        char[] strChar = str.toCharArray();
        char temp = strChar[0];

        StringBuilder stringBuilder = new StringBuilder(strLen);
        for (int i = 1; i < strLen; i++) {

            char val = strChar[i];
            if (val != temp) {
                stringBuilder.append(temp);
                temp = val;
            }
        }
        stringBuilder.append(temp);

        return stringBuilder.toString();
    }

Upvotes: 0

Ahmet Noyan Kızıltan
Ahmet Noyan Kızıltan

Reputation: 252

You may wish to use \p{L}\p{M}* instead of [a-zA-Z] to include non-English unicode letters as well. So it will be like this: replaceAll("(\\p{L}\\p{M}*)(\\1{" + maxAllowedRepetition + ",})", "$1"); or this: replaceAll("(\\p{L}\\p{M}*)\\1{" + maxAllowedRepetition + ",}", "$1");

Upvotes: 0

supergra
supergra

Reputation: 1608

There are no English words that I know of that have more than two consecutive identical letters.

  1. Iterate over all words
  2. If the word has more than two consecutive identical letters, then:
    • Remove all but two of the duplicate letters, and see if a valid word is formed.
    • Otherwise, remove all but one duplicate letter, and see if a valid word is formed.
    • Otherwise, fail.

This approach would not catch:

  • partyy

  • "stoop" (plus that's ambiguous! Is that "stop" with an extra "o" or simply "stoop")

Upvotes: 2

Masudul
Masudul

Reputation: 21981

Try using loop,

 String word="Stoooppppd";
    StringBuilder res=new StringBuilder();
    char first=word.charAt(0);
    res.append(first);
    for (int i = 1; i < word.length(); i++) {
        char ch=word.charAt(i);
        if(ch!=first){
           res.append(ch);
        }
       first=ch;
    }
    System.out.println(res);

Upvotes: 1

Related Questions