Dogukan Yılmaz
Dogukan Yılmaz

Reputation: 556

Removing duplicate elements within a pandas cell and counting the number of elements

I have a dataframe like this:

Destinations
Paris,Oslo, Paris,Milan, Athens,Amsterdam
Boston,New York, Boston,London, Paris,New York
Nice,Paris, Milan,Paris, Nice,Milan

I want to get the following dataframe (without space between the cities):

Destinations_2                         no_destinations
Paris,Oslo,Milan,Athens,Amsterdam      5
Boston,New York,London,Paris           4
Nice,Paris,Milan                       3

How to remove duplicates within a cell?

Upvotes: 2

Views: 59

Answers (4)

steven
steven

Reputation: 2519

# your data:
import pandas as pd

data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam',
                         'Boston,New York, Boston,London, Paris,New York',
                         'Nice,Paris, Milan,Paris, Nice,Milan']}
df = pd.DataFrame(data)

>>>
                                      Destinations
0        Paris,Oslo, Paris,Milan, Athens,Amsterdam
1   Boston,New York, Boston,London, Paris,New York
2              Nice,Paris, Milan,Paris, Nice,Milan

First: make every row of your column a list.

df.Destinations = df.Destinations.apply(lambda x: x.replace(', ', ',').split(','))

>>>
                                           Destinations
0        [Paris, Oslo, Paris, Milan, Athens, Amsterdam]
1   [Boston, New York, Boston, London, Paris, New York]
2              [Nice, Paris, Milan, Paris, Nice, Milan]

Second: removes dups from the lists

df.Destinations = df.Destinations.apply(lambda x: list(dict.fromkeys(x)))

# or: df.Destinations = df.Destinations.apply(lambda x: list(set(x)))

>>>
                               Destinations
0   [Paris, Oslo, Milan, Athens, Amsterdam]
1         [Boston, New York, London, Paris]
2                      [Nice, Paris, Milan]

Finally, create desired columns

df['no_destinations'] = df.Destinations.apply(lambda x: len(x))

df['Destinations_2'] = df.Destinations.apply(lambda x: ','.join(x))

All steps use the apply and lambda functions, you can chain or nest them together if you want

Upvotes: 1

Chankey Pathak
Chankey Pathak

Reputation: 21676

All the previous answers have addressed only one part of your problem i.e. to show the unique count (no_destinations). Let me try to answer both of your queries.

The idea below is to apply a method on the Destinations column which returns 2 series named Destinations_2 and no_destinations which contain unique elements separated by comma with no space, and count of unique elements, respectively.

import pandas as pd
data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam', 
                'Boston,New York, Boston,London, Paris,New York',
                'Nice,Paris, Milan,Paris, Nice,Milan'
    ]}

def remove_dups(x):
    data = set(x.replace(" ", "").split(','))
    return pd.Series([','.join(data),len(data)], index=['Destinations_2', 'no_destinations'])

df = pd.DataFrame.from_dict(data)
df[['Destinations_2', 'no_destinations']] = df['Destinations'].apply(remove_dups)
print(df.head())

Output:

enter image description here

Note: As you are not concerned with the order, I have used set above. If you need to maintain the order, you will have to replace set with some other logic to remove duplicates.

Upvotes: 0

iamklaus
iamklaus

Reputation: 3770

df['no_destinations'] = df.Destinations.str.split(',').apply(set).apply(len)

if there are spaces in between use

df.Destinations.str.split(',').apply(lambda x: list(map(str.strip,x))).apply(set).apply(len)

Output

                                     Destinations  nodestinations
0       Paris,Oslo, Paris,Milan, Athens,Amsterdam               5
1  Boston,New York, Boston,London, Paris,New York               4
2             Nice,Paris, Milan,Paris, Nice,Milan               3

Upvotes: 1

anky
anky

Reputation: 75080

You can use a list comprehension which is faster than using apply() (replace Col with the original column name) :

df['no_destinations']=[len(set([a.strip() for a in i.split(',')])) for i in df['Col']]
print(df)

                                              Col  no_destinations
0       Paris,Oslo, Paris,Milan, Athens,Amsterdam                5
1  Boston,New York, Boston,London, Paris,New York                4
2             Nice,Paris, Milan,Paris, Nice,Milan                3

Upvotes: 3

Related Questions