Reputation: 13401
I have a dataframe as below:
df = pd.DataFrame({'a':[1,2,3,4],
'b':["west, east", "east, north","south, west","east, south"]})
a b
0 1 west, east
1 2 east, north
2 3 south, west
3 4 east, south
I want to get unique strings from column b
like below.
Expected Output:
["east", "west", "north", "south"] # order doesn't matter here
My Efforts
op = []
for _, value in df['b'].items():
op.extend(value)
op = set(op)
Which is giving me correct result but is there more efficient way to do so?
My original dataset has around million rows and 1000s of unqiue values.
Upvotes: 1
Views: 1295
Reputation: 404
You'd have to profile your code to determine if this is quicker for your particular usecase, but using pandas in-built vectorised methods might show some benefit on larger data sets.
Try using a combination of Series.str.split() and Series.unique().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html
# Split column of lists into strings
df_split = df['b'].str.rsplit(',', n=-1, expand=True)
# For each column, get unique values and append to set
uniques = set()
for col in df_split:
uniques.update(df_split[col].unique())
Upvotes: 1
Reputation: 862921
You can create long string by join
, then split
and convert to set
, last to list
:
a = list(set(', '.join(df['b']).split(', ')))
print (a)
['south', 'north', 'west', 'east']
Or use set comprehension
with split
and flattening:
a = list(set([y for x in df['b'] for y in x.split(', ')]))
print (a)
['south', 'north', 'west', 'east']
Pure pandas solution is use Series.str.split
, DataFrame.stack
, Series.unique
and convert to list
:
a = df.b.str.split(', ', expand=True).stack().unique().tolist()
Upvotes: 5