Joe121
Joe121

Reputation: 62

Remove consecutive duplicate entries from pandas in each cell

I have a data frame that looks like

d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
pd.DataFrame(data=d)

expected output

d={'col1':['a,b','a,c,b'],'col2':['a,b','a,b,a']}

I have tried like this :

arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])

How do I remove the duplicate entries in each row and column of dataframe?

a,a,b,c should be a,b,c

Upvotes: 3

Views: 163

Answers (2)

anky
anky

Reputation: 75080

From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:

def myfunc(x):
    s=pd.Series(x.split(','))
    res=s[s.ne(s.shift())]
    return ','.join(res.values)

print(df.applymap(myfunc))

    col1   col2
0    a,b    a,b
1  a,c,b  a,b,a

Another function can be created with itertools.groupby such as :

from itertools import groupby
def myfunc(x):
    l=[x[0] for x in groupby(x.split(','))]
    return ','.join(l)

Upvotes: 1

Dan
Dan

Reputation: 1587

You could define a function to help with this, then use .applymap to apply it to all columns (or .apply one column at a time):

d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
df = pd.DataFrame(data=d)

def remove_dups(string):
    split = string.split(',')  # split string into a list
    uniques = set(split)       # remove duplicate list elements
    return ','.join(uniques)   # rejoin the list elements into a string

result = df.applymap(remove_dups)

This returns:

    col1 col2
0    a,b  a,b
1  a,c,b  a,b

Edit: This looks slightly different to your expected output, why do you expect a,b,a for the second row in col2?

Edit2: to preserve the original order, you can replace the set() function with unique_everseen()

from more_itertools import unique_everseen

. . .

uniques = unique_everseen(split)

Upvotes: 0

Related Questions