Remove consecutive duplicate entries from pandas in each cell

Question

I have a data frame that looks like

d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
pd.DataFrame(data=d)

expected output

d={'col1':['a,b','a,c,b'],'col2':['a,b','a,b,a']}

I have tried like this :

arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])

How do I remove the duplicate entries in each row and column of dataframe?

a,a,b,c should be a,b,c

anky · Accepted Answer

From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:

def myfunc(x):
    s=pd.Series(x.split(','))
    res=s[s.ne(s.shift())]
    return ','.join(res.values)

print(df.applymap(myfunc))

    col1   col2
0    a,b    a,b
1  a,c,b  a,b,a

Another function can be created with itertools.groupby such as :

from itertools import groupby
def myfunc(x):
    l=[x[0] for x in groupby(x.split(','))]
    return ','.join(l)

Answers (2)