Dibakar
Dibakar

Reputation: 159

Replacing consecutive repeated characters in java

I am working on twitter data normalization. Twitter users frequently uses terms like ts I looooooove it in order to emphasize the word love. I want to such repeated characters to a proper English word by replacing repeat characters till I get a proper meaningful word (I am aware that I can not differentiate between good and god by this mechanism).

My strategy would be

  1. identify existence of such repeated strings. I would look for more than 2 same characters, as probably there is no English word with more than two repeat characters.

    String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };
    
    String regex = "([a-z])\\1{2,}";
    Pattern pattern = Pattern.compile(regex);
    
    for (String string : strings) {
         Matcher matcher = pattern.matcher(string);
         if (matcher.find()) {
             System.out.println(string+" TRUE ");
         }
    }
    
  2. Search for such words in a Lexicon like Wordnet

  3. Replace all but two such repeat characters and check in Lexicon
  4. If not there in the Lexicon remove one more repeat character (Otherwise treat it as misspelling).

Due to my poor Java knowledge I am unable to manage 3 and 4. Problem is I can not replace all but two repeated consecutive characters. Following code snippet replace all but one repeated characters System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));

Help is required to find out A. How to replace all but 2 consecutive repeat characters B. How to remove one more consecutive character from the output of A [I think B can be managed by the following code snippet]

System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));

Edit: Solution provided by Wiktor Stribiżew works perfectly in Java. I was wondering what changes are required to get the same result in python. Python uses re.sub.

Upvotes: 3

Views: 8184

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

Your regex ([a-z])\\1{2,} matches and captures an ASCII letter into Group 1 and then matches 2 or more occurrences of this value. So, all you need to replace with a backreference, $1, that holds the value captured. If you use one $1, the aaaaa will be replaced with a single a and if you use $1$1, it will be replaced with aa.

String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");

See the Java demo.

If you need to make your regex case insensitive, use "(?i)([a-z])\\1{2,}" or even "(\\p{Alpha})\\1{2,}". If any Unicode letters must be handled, use "(\\p{L})\\1{2,}".

BONUS: In a general case, to replace any amount of any repeated consecutive chars use

text = text.replaceAll("(?s)(.)\\1+", "$1");   // any chars
text = text.replaceAll("(.)\\1+", "$1");       // any chars but line breaks
text = text.replaceAll("(\\p{L})\\1+", "$1");  // any letters
text = text.replaceAll("(\\w)\\1+", "$1");     // any ASCII alnum + _ chars

Upvotes: 3

Muchu Chinna
Muchu Chinna

Reputation: 1

/*This code checks a character in a given string repeated consecutively 3 times
 if you want to check for 4 consecutive times change count==2--->count==3 OR
 if you want to check for 2 consecutive times change count==2--->count==1*/
public class Test1 {
    static char ch;
    public static void main(String[] args) {
        String str="aabbbbccc";
        char[] charArray = str.toCharArray();
        int count=0;
        for(int i=0;i<charArray.length;i++){
            if(i!=0 ){
            if(charArray[i]==ch)continue;//ddddee
            if(charArray[i]==charArray[i-1]) {
                count++;
                if(count==2){
                    System.out.println(charArray[i]);
                    count=0;
                    ch=charArray[i];
                }   
            }
            else{
                count=0;//aabb

            }
            }


        }

    }

}

Upvotes: 0

Related Questions