Reputation: 63
For example, in the row A;AC=a,a;AD=E;AE=W;AF=u,u;AG=Q;AH=R
, there is repeated “,a” and “,u”.
The output wanted is A;AC=a;AD=E;AE=W;AF=u;AG=Q;AH=R
It is quite hard to correct the repeated words in the 'info' column. I need to delete the comma and following character.
This is the dataframe:
df = pd.DataFrame([['A','B','C','A;AC=a,a;AD=E;AE=W;AF=u,u;AG=Q;AH=R','F','G'],
['h','k','J','AB=k;AC=5,5;AD=E;AF=W;AG=y,y;AH=Q','L','M'],['O','P','Q','AC=k;AD=e;AE=E;AF=W;AG=y;AH=Q;AK=R','S','T'],
['U','V','W','AC=a;AD=b;AE=r;AF=y;AG=Q;AH=R','Y','Z'],['U','V','W','AC=a;AD=b;AE=r,r;AF=y;AG=Q;AH=R','Y','Z']], columns = ['Col1','Col2','Col3','info','col4','col5'])
We get this result as the diagram.
For example, We see in the 'info' column, "AC=a,a" has a repeated a. We need delete "a" therfore we need to delete the comma too. In the same column, there is "AF=u,u", the "u" character is also repeated, we need to remove "u" and its comma.In the next row, we see "AC=5,5;AD=E;AF=W;AG=y,y", here there is two more character 5,y and their comma.
This the diagram that would be wanted result.
So how to get the final result?
Upvotes: 2
Views: 342
Reputation: 3429
A very simple solution based on a list comprehension and split
and join
functions.
df['info'] = [';'.join(e.split(',')[0] for e in d.split(';')) for d in df['info']]
Upvotes: 0
Reputation: 150735
You can try regex back reference:
# \1 refers to the previously capture group
# updated
pattern = r'([^=,]+),(\1)'
# if you have more than two instances, e.g. a,a,a
# use
# pattern = r'([^=,]+),(\1)'
df['info'] = df['info'].str.replace(pattern, r'\1')
Output:
Col1 Col2 Col3 info col4 col5
0 A B C A;AC=a;AD=E;AE=W;AF=0.500;AG=Q;AH=R F G
1 h k J AB=k;AC=5;AD=E;AF=W;AG=y;AH=Q L M
2 O P Q AC=k;AD=e;AE=E;AF=W;AG=y;AH=Q;AK=R S T
3 U V W AC=a;AD=b;AE=r;AF=y;AG=Q;AH=R Y Z
4 U V W AC=a;AD=b;AE=r;AF=y;AG=Q;AH=R Y Z
Upvotes: 3
Reputation: 672
Regex .sub will also do the trick.
import regex as re
df['info'] = [re.sub(r'(.),\1', r'\1', x) for x in df['info'] ]
df
In this expression, (.) refers to any character group with one character, then we have a comma, and then \1 refers to that same character group again. So we sub in the character that fit that pattern.
Output
Col1 Col2 Col3 info col4 col5
0 A B C A;AC=a;AD=E;AE=W;AF=u;AG=Q;AH=R F G
1 h k J AB=k;AC=5;AD=E;AF=W;AG=y;AH=Q L M
2 O P Q AC=k;AD=e;AE=E;AF=W;AG=y;AH=Q;AK=R S T
3 U V W AC=a;AD=b;AE=r;AF=y;AG=Q;AH=R Y Z
4 U V W AC=a;AD=b;AE=r;AF=y;AG=Q;AH=R Y Z
Upvotes: 1