Reputation: 7056
I have a column in a dataframe which has uuids attached with some other file info:
ff8738hjgdj792__somevar1.txt
9jldh93k4043ik__some3var.txt
I would like to sort the dataframe based on the first uuid field (till the double underscores) and ignore the other attached string
to sort?
At the moment I do:
df.sort_values(by='df_column_name')
but this is not yielding the desired result because pd is taking the entire string into account.
How do I go about achieving this with pandas?
Upvotes: 2
Views: 1295
Reputation: 25259
Pandas 1.1.0+ has parameter key
. Use it to sort as regular python sort
Sample df
:
col1
0 ff8738hjgdj792__somevar1.txt
1 9jldh93k4043ik__some3var.txt
df['col1'].sort_values(key=lambda x: x.str.split('__').str[0])
Out[809]:
1 9jldh93k4043ik__some3var.txt
0 ff8738hjgdj792__somevar1.txt
Name: col1, dtype: object
Or
df_final = df.sort_values(by='col1',key=lambda x: x.str.split('__').str[0])
Out[812]:
col1
1 9jldh93k4043ik__some3var.txt
0 ff8738hjgdj792__somevar1.txt
Upvotes: 1
Reputation: 1386
Since you are already using pandas, I suggest adding pandasql. It makes it easy to accomplish what you're looking for.
import pandas as pd
import pandasql as ps
# Recreating the data you provided
df = pd.DataFrame(['ff8738hjgdj792__somevar1.txt', '9jldh93k4043ik__some3var.txt'], columns = ['something'])
# Selecting and sorting by the the the length of the substring you're looking for
df_res = ps.sqldf("""
select something
from df
order by substr(something, 0, length('ff8738hjgdj792')) """, locals())
print(df_res)
Returns
something
0 9jldh93k4043ik__some3var.txt
1 ff8738hjgdj792__somevar1.txt
Upvotes: 0