Reputation: 556
I have a dataframe like this:
Destinations
Paris,Oslo, Paris,Milan, Athens,Amsterdam
Boston,New York, Boston,London, Paris,New York
Nice,Paris, Milan,Paris, Nice,Milan
I want to get the following dataframe (without space between the cities):
Destinations_2 no_destinations
Paris,Oslo,Milan,Athens,Amsterdam 5
Boston,New York,London,Paris 4
Nice,Paris,Milan 3
How to remove duplicates within a cell?
Upvotes: 2
Views: 59
Reputation: 2519
# your data:
import pandas as pd
data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam',
'Boston,New York, Boston,London, Paris,New York',
'Nice,Paris, Milan,Paris, Nice,Milan']}
df = pd.DataFrame(data)
>>>
Destinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam
1 Boston,New York, Boston,London, Paris,New York
2 Nice,Paris, Milan,Paris, Nice,Milan
First: make every row of your column a list.
df.Destinations = df.Destinations.apply(lambda x: x.replace(', ', ',').split(','))
>>>
Destinations
0 [Paris, Oslo, Paris, Milan, Athens, Amsterdam]
1 [Boston, New York, Boston, London, Paris, New York]
2 [Nice, Paris, Milan, Paris, Nice, Milan]
Second: removes dups from the lists
df.Destinations = df.Destinations.apply(lambda x: list(dict.fromkeys(x)))
# or: df.Destinations = df.Destinations.apply(lambda x: list(set(x)))
>>>
Destinations
0 [Paris, Oslo, Milan, Athens, Amsterdam]
1 [Boston, New York, London, Paris]
2 [Nice, Paris, Milan]
Finally, create desired columns
df['no_destinations'] = df.Destinations.apply(lambda x: len(x))
df['Destinations_2'] = df.Destinations.apply(lambda x: ','.join(x))
All steps use the apply
and lambda
functions, you can chain or nest them together if you want
Upvotes: 1
Reputation: 21676
All the previous answers have addressed only one part of your problem i.e. to show the unique count (no_destinations
). Let me try to answer both of your queries.
The idea below is to apply a method on the Destinations
column which returns 2 series named Destinations_2
and no_destinations
which contain unique elements separated by comma with no space, and count of unique elements, respectively.
import pandas as pd
data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam',
'Boston,New York, Boston,London, Paris,New York',
'Nice,Paris, Milan,Paris, Nice,Milan'
]}
def remove_dups(x):
data = set(x.replace(" ", "").split(','))
return pd.Series([','.join(data),len(data)], index=['Destinations_2', 'no_destinations'])
df = pd.DataFrame.from_dict(data)
df[['Destinations_2', 'no_destinations']] = df['Destinations'].apply(remove_dups)
print(df.head())
Output:
Note: As you are not concerned with the order, I have used set
above. If you need to maintain the order, you will have to replace set
with some other logic to remove duplicates.
Upvotes: 0
Reputation: 3770
df['no_destinations'] = df.Destinations.str.split(',').apply(set).apply(len)
if there are spaces in between use
df.Destinations.str.split(',').apply(lambda x: list(map(str.strip,x))).apply(set).apply(len)
Output
Destinations nodestinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam 5
1 Boston,New York, Boston,London, Paris,New York 4
2 Nice,Paris, Milan,Paris, Nice,Milan 3
Upvotes: 1
Reputation: 75080
You can use a list comprehension which is faster than using apply()
(replace Col
with the original column name) :
df['no_destinations']=[len(set([a.strip() for a in i.split(',')])) for i in df['Col']]
print(df)
Col no_destinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam 5
1 Boston,New York, Boston,London, Paris,New York 4
2 Nice,Paris, Milan,Paris, Nice,Milan 3
Upvotes: 3