Reputation: 63

Delete repeated characters in string column in pandas?

For example, in the row A;AC=a,a;AD=E;AE=W;AF=u,u;AG=Q;AH=R, there is repeated “,a” and “,u”.

The output wanted is A;AC=a;AD=E;AE=W;AF=u;AG=Q;AH=R

It is quite hard to correct the repeated words in the 'info' column. I need to delete the comma and following character.

This is the dataframe:

df = pd.DataFrame([['A','B','C','A;AC=a,a;AD=E;AE=W;AF=u,u;AG=Q;AH=R','F','G'],
                  ['h','k','J','AB=k;AC=5,5;AD=E;AF=W;AG=y,y;AH=Q','L','M'],['O','P','Q','AC=k;AD=e;AE=E;AF=W;AG=y;AH=Q;AK=R','S','T'],
                  ['U','V','W','AC=a;AD=b;AE=r;AF=y;AG=Q;AH=R','Y','Z'],['U','V','W','AC=a;AD=b;AE=r,r;AF=y;AG=Q;AH=R','Y','Z']], columns = ['Col1','Col2','Col3','info','col4','col5'])

We get this result as the diagram.

For example, We see in the 'info' column, "AC=a,a" has a repeated a. We need delete "a" therfore we need to delete the comma too. In the same column, there is "AF=u,u", the "u" character is also repeated, we need to remove "u" and its comma.In the next row, we see "AC=5,5;AD=E;AF=W;AG=y,y", here there is two more character 5,y and their comma.

This the diagram that would be wanted result.

So how to get the final result?

Upvotes: 2

Answers (3)

Andrei Odegov

Reputation: 3429

A very simple solution based on a list comprehension and split and join functions.

df['info'] = [';'.join(e.split(',')[0] for e in d.split(';')) for d in df['info']]

Upvotes: 0

Quang Hoang

Reputation: 150735

You can try regex back reference:

# \1 refers to the previously capture group
# updated
pattern = r'([^=,]+),(\1)'
# if you have more than two instances, e.g. a,a,a
# use
# pattern = r'([^=,]+),(\1)'


df['info'] = df['info'].str.replace(pattern, r'\1')

Output:

  Col1 Col2 Col3                                 info col4 col5
0    A    B    C  A;AC=a;AD=E;AE=W;AF=0.500;AG=Q;AH=R    F    G
1    h    k    J        AB=k;AC=5;AD=E;AF=W;AG=y;AH=Q    L    M
2    O    P    Q   AC=k;AD=e;AE=E;AF=W;AG=y;AH=Q;AK=R    S    T
3    U    V    W        AC=a;AD=b;AE=r;AF=y;AG=Q;AH=R    Y    Z
4    U    V    W        AC=a;AD=b;AE=r;AF=y;AG=Q;AH=R    Y    Z

Upvotes: 3

bsauce

Reputation: 672

Regex .sub will also do the trick.

import regex as re
df['info'] = [re.sub(r'(.),\1', r'\1', x) for x in df['info'] ]
df

In this expression, (.) refers to any character group with one character, then we have a comma, and then \1 refers to that same character group again. So we sub in the character that fit that pattern.

Output

  Col1 Col2 Col3                                info col4 col5
0    A    B    C     A;AC=a;AD=E;AE=W;AF=u;AG=Q;AH=R    F    G
1    h    k    J       AB=k;AC=5;AD=E;AF=W;AG=y;AH=Q    L    M
2    O    P    Q  AC=k;AD=e;AE=E;AF=W;AG=y;AH=Q;AK=R    S    T
3    U    V    W       AC=a;AD=b;AE=r;AF=y;AG=Q;AH=R    Y    Z
4    U    V    W       AC=a;AD=b;AE=r;AF=y;AG=Q;AH=R    Y    Z

Upvotes: 1

Delete repeated characters in string column in pandas?

Answers (3)

Related Questions