Pandas Dataframe - sort a list of values in each row of a column

Question

Here is a pandas dataframe:

dt          name  type                                         City                            
05-10-2021  MK    [PQRRC, MNXYZ, AYPIC, KLUYT, GFTBE, BYPAC]   NYC
05-10-2021  MK    [GFTBE, AYPIC, MNXYZ, BYPAC, KLUYT, PQRRC]   NYC
05-12-2021  MK    [KLUYT, PQRRC, BYPAC, AYPIC, GFTBE, MNXYZ]   NYC
05-12-2021  MK    [BYPAC, KLUYT, GFTBE, AYPIC, MNXYZ, PQRRC]   NYC
05-13-2021  PS    [XYDFE, QRTSL, CPQLE, VXWUT, ORSHC, LTRDX]   BAL
05-13-2021  PS    [VXWUT, ORSHC, QRTSL, XYDFE, LTRDX, CPQLE]   BAL

.... Please note that the list of values in column type for each column name is the same but not sorted in alphabetical order.

I want the output as below: sort the column type and find the distinct dt, name, type, City.

dt          name  type                                         City                            
05-10-2021  MK    [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]   NYC
05-12-2021  MK    [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]   NYC
05-13-2021  PS    [CPQLE, LTRDX, ORSHC, QRTSL, VXWUT, XYDFE]   BAL

I tried using sort_values, sorted, drop_duplicates; not working. May be I made some mistakes. Its dropping some names altogether, meaning missing some names when using drop_duplicates(). Can someone help me? Thank you.

SeaBean · Accepted Answer

If you want to sort the lists in column type and remove the duplicates checked based on other columns, you can use numpy.sort() to sort the list, and then use .drop_duplicates() to check duplicates on other columns:

Using numpy.sort() is more performance efficient than similar Python processing since numpy modules are optimized for system performance and run faster for Pandas and numpy lists/arrays.

import numpy as np

# in case your column "type" is of string type, run one of the following line (depending on your string list layout):
# use this for string list layout e.g. "['GFTBE', 'AYPIC', 'MNXYZ', 'BYPAC', 'KLUYT', 'PQRRC']"
df['type'] = df['type'].str.strip("[]").str.replace("'", "").str.split(', ')   
#df['type'] = df['type'].map(eval)    # for general use to convert string like a list to a real list
#df['type'] = df['type'].str.strip('[]').str.split(',')  # for use when no extra spaces and extra single quotes  


df['type'] = df['type'].map(np.sort).map(list)   # convert the sorted numpy array to Python list to avoid incorrect formatting (e.g. missing comma) in writing to CSV 
df = df.drop_duplicates(subset=['dt', 'name', 'City'])

Result:

print(df)

           dt name                                        type City
0  05-10-2021   MK  [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]  NYC
2  05-12-2021   MK  [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]  NYC
4  05-13-2021   PS  [CPQLE, LTRDX, ORSHC, QRTSL, VXWUT, XYDFE]  BAL

Pandas Dataframe - sort a list of values in each row of a column

Answers (2)

Related Questions