Reputation: 8554

Python Pandas: How to unique strings in a column

I have a table like this:

col1 col2
ben US-US-Uk
Man Uk-NL-DE
bee CA-CO-MX-MX

how can I unique the values in col 2, which means have a table like this?

col1 col2
ben US-Uk
Man Uk-NL-DE
bee CA-CO-MX

I have tried this :

a.cc.str.split('-').unique()

but get the following error:

TypeError: unhashable type: 'list'

Does anybody know how to do this?

Upvotes: 1

Answers (3)

Jonathan Eunice

Reputation: 22463

I like @EdChum's answer. But reordering the values is disconcerting. It can make both human visual inspections and mechanical comparisons more difficult.

Unfortunately, Python doesn't have an ordered set, which would be the perfect tool here. So:

def unique(items):
    """
    Return unique items in a list, in the same order they were
    originally.
    """
    seen = set()
    result = []
    for item in items:
        if item not in seen:
            result.append(item)
            seen.add(item)
    return result

df.col2 = df.col2.apply(lambda x: '-'.join(unique(x.split('-'))))

An alternative way of creating an ordered set is with OrderedDict:

from collections import OrderedDict

def u2(items):
    od = OrderedDict.fromkeys(items)
    return list(od.keys())

You can then use u2 instead of unique. Either way, the results are:

  col1      col2
0  ben     US-Uk
1  Man  Uk-NL-DE
2  bee  CA-CO-MX

Upvotes: 2

EdChum

Reputation: 394099

You can use apply to call a lambda function that splits the string and then joins on the unique values:

In [10]:

df['col2'] = df['col2'].apply(lambda x: '-'.join(set(x.split('-'))))
df
Out[10]:
  col1      col2
0  ben     Uk-US
1  Man  Uk-NL-DE
2  bee  CA-CO-MX

Another method:

In [22]:

df['col2'].str.split('-').apply(lambda x: '-'.join(set(x)))

Out[22]:
0       Uk-US
1    Uk-NL-DE
2    CA-CO-MX
Name: col2, dtype: object

timings

In [24]:

%timeit df['col2'].str.split('-').apply(lambda x: '-'.join(set(x)))
%timeit df['col2'] = df['col2'].apply(lambda x: '-'.join(set(x.split('-'))))
1000 loops, best of 3: 418 µs per loop
1000 loops, best of 3: 246 µs per loop

Upvotes: 2

Saikiran Yerram

Reputation: 3092

Try this

col2 = 'CA-CO-MX-MX'
print '-'.join(set(col2.split('-')))

Upvotes: 1

Python Pandas: How to unique strings in a column

Answers (3)

Related Questions