Junaid Mohammad
Junaid Mohammad

Reputation: 477

maintaining ordering when using unique fuction in python

I have some code, where the following, say, are the columns of my df.

df.columns = ['A1', 'A2', 'B1', 'B2', 'C1', 'C2', 'D1', 'D2', 'E1', 'E2']

list = df.columns.str[:1]
list = np.unique(list)

I am trying to get the unique values of the letters, and numbers, but in the correct order.

My code doesn't maintain the ordering and I cant figure out how to do so.

Thank you

expected output:

letters = [A, B, C, D, E]
numbers = [1, 2]

Upvotes: 1

Views: 92

Answers (3)

Darkonaut
Darkonaut

Reputation: 21674

This one uses regex and would continue working in case you have multiple characters/numbers in your column names:

import re
import pandas as pd

df = pd.DataFrame(columns=['EE2', 'A1', 'A2', 'B1', 'B2', 'C1', 'C2', 'D1', 'D11', 'E1'])

split_ = [re.findall('\d+|\D+', col) for col in df.columns]

list(pd.Series([col[0] for col in split_]).drop_duplicates())
# ['EE', 'A', 'B', 'C', 'D', 'E']
list(pd.Series([col[1] for col in split_]).drop_duplicates())
# ['2', '1', '11']

Upvotes: 1

FHTMitchell
FHTMitchell

Reputation: 12157

Assuming your example is representative, you can use a neat little trick that I got from Raymond Hettinger. In python 3.6 and later, dicts are ordered so you can use their keys as efficient ordered sets.

list(dict.fromkeys(c[0] for c in df.columns))
# --> ['A', 'B', 'C', 'D', 'E']

list(dict.fromkeys(int(c[1]) for c in df.columns))
# --> [1, 2]

Upvotes: 2

jpp
jpp

Reputation: 164773

You can use toolz.unique instead. This is identical to the unique_everseen recipe found in the itertools docs. Internally, it iterates while maintaining a set of seen items.

df = pd.DataFrame(columns=['A1', 'A2', 'B1', 'B2', 'C1', 'C2', 'D1', 'D2', 'E1', 'E2'])

from toolz import unique

res = list(unique(df.columns.str[:1]))

['A', 'B', 'C', 'D', 'E']

A more Pandorable solution would be to convert the Index object to pd.Series and use drop_duplicates. This, again, uses hashing:

res = df.columns.str[:1].to_series().drop_duplicates().values

array(['A', 'B', 'C', 'D', 'E'], dtype=object)

Upvotes: 1

Related Questions