JohnJ
JohnJ

Reputation: 7056

sort pandas string column by first few characters of string

I have a column in a dataframe which has uuids attached with some other file info:

ff8738hjgdj792__somevar1.txt
9jldh93k4043ik__some3var.txt

I would like to sort the dataframe based on the first uuid field (till the double underscores) and ignore the other attached string to sort?

At the moment I do:

df.sort_values(by='df_column_name')

but this is not yielding the desired result because pd is taking the entire string into account.

How do I go about achieving this with pandas?

Upvotes: 2

Views: 1295

Answers (2)

Andy L.
Andy L.

Reputation: 25259

Pandas 1.1.0+ has parameter key. Use it to sort as regular python sort

Sample df:

                           col1
0  ff8738hjgdj792__somevar1.txt
1  9jldh93k4043ik__some3var.txt

df['col1'].sort_values(key=lambda x: x.str.split('__').str[0])

Out[809]:
1    9jldh93k4043ik__some3var.txt
0    ff8738hjgdj792__somevar1.txt
Name: col1, dtype: object

Or

df_final = df.sort_values(by='col1',key=lambda x: x.str.split('__').str[0])

Out[812]:
                           col1
1  9jldh93k4043ik__some3var.txt
0  ff8738hjgdj792__somevar1.txt

Upvotes: 1

Lars Skaug
Lars Skaug

Reputation: 1386

Since you are already using pandas, I suggest adding pandasql. It makes it easy to accomplish what you're looking for.

import pandas as pd
import pandasql as ps

# Recreating the data you provided
df = pd.DataFrame(['ff8738hjgdj792__somevar1.txt', '9jldh93k4043ik__some3var.txt'], columns = ['something']) 

# Selecting and sorting by the the the length of the substring you're looking for
df_res = ps.sqldf("""
    select something 
    from df 
    order by substr(something, 0, length('ff8738hjgdj792')) """, locals())


print(df_res)

Returns

                      something
0  9jldh93k4043ik__some3var.txt
1  ff8738hjgdj792__somevar1.txt

Upvotes: 0

Related Questions