Reputation: 41
Here is a pandas dataframe:
dt name type City
05-10-2021 MK [PQRRC, MNXYZ, AYPIC, KLUYT, GFTBE, BYPAC] NYC
05-10-2021 MK [GFTBE, AYPIC, MNXYZ, BYPAC, KLUYT, PQRRC] NYC
05-12-2021 MK [KLUYT, PQRRC, BYPAC, AYPIC, GFTBE, MNXYZ] NYC
05-12-2021 MK [BYPAC, KLUYT, GFTBE, AYPIC, MNXYZ, PQRRC] NYC
05-13-2021 PS [XYDFE, QRTSL, CPQLE, VXWUT, ORSHC, LTRDX] BAL
05-13-2021 PS [VXWUT, ORSHC, QRTSL, XYDFE, LTRDX, CPQLE] BAL
.... Please note that the list of values in column type for each column name is the same but not sorted in alphabetical order.
I want the output as below: sort the column type and find the distinct dt, name, type, City.
dt name type City
05-10-2021 MK [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC] NYC
05-12-2021 MK [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC] NYC
05-13-2021 PS [CPQLE, LTRDX, ORSHC, QRTSL, VXWUT, XYDFE] BAL
I tried using sort_values, sorted, drop_duplicates; not working. May be I made some mistakes. Its dropping some names altogether, meaning missing some names when using drop_duplicates(). Can someone help me? Thank you.
Upvotes: 0
Views: 2235
Reputation: 23217
If you want to sort the lists in column type
and remove the duplicates checked based on other columns, you can use numpy.sort()
to sort the list, and then use .drop_duplicates()
to check duplicates on other columns:
Using numpy.sort() is more performance efficient than similar Python processing since numpy modules are optimized for system performance and run faster for Pandas and numpy lists/arrays.
import numpy as np
# in case your column "type" is of string type, run one of the following line (depending on your string list layout):
# use this for string list layout e.g. "['GFTBE', 'AYPIC', 'MNXYZ', 'BYPAC', 'KLUYT', 'PQRRC']"
df['type'] = df['type'].str.strip("[]").str.replace("'", "").str.split(', ')
#df['type'] = df['type'].map(eval) # for general use to convert string like a list to a real list
#df['type'] = df['type'].str.strip('[]').str.split(',') # for use when no extra spaces and extra single quotes
df['type'] = df['type'].map(np.sort).map(list) # convert the sorted numpy array to Python list to avoid incorrect formatting (e.g. missing comma) in writing to CSV
df = df.drop_duplicates(subset=['dt', 'name', 'City'])
Result:
print(df)
dt name type City
0 05-10-2021 MK [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC] NYC
2 05-12-2021 MK [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC] NYC
4 05-13-2021 PS [CPQLE, LTRDX, ORSHC, QRTSL, VXWUT, XYDFE] BAL
Upvotes: 3
Reputation: 697
Try the below:
df["type"] = df["type"].apply(lambda x: sorted(list(x)))
this assume that all the values of the column 'type' are lists
Upvotes: 0