How to sort each row of pandas dataframe and return column index based on sorted values of row

Question

I am trying to sort each row of pandas dataframe and get the index of sorted values in a new dataframe. I could do it in a slow way. Can anyone suggest improvements using parallelization or vectorized code for this. I have posted an example below.

data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'

# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)

# drop categorical column
gapminder.drop(['country', 'continent'], axis=1, inplace=True) 

# print the first three rows
print(gapminder.head(n=3))

   year         pop  lifeExp   gdpPercap
0  1952   8425333.0   28.801  779.445314
1  1957   9240934.0   30.332  820.853030
2  1962  10267083.0   31.997  853.100710

The result I am looking for is this

tag_0   tag_1   tag_2   tag_3
0   pop year    gdpPercap   lifeExp
1   pop year    gdpPercap   lifeExp
2   pop year    gdpPercap   lifeExp

In this case, since pop is always higher than gdpPercap and lifeExp, it always comes first.

I could achieve the required output by using the following code. But the computation takes longer time if the df has lot of rows/columns.

Can anyone suggest an improvement over this

def sort_df(df):
    sorted_tags = pd.DataFrame(index = df.index, columns = ['tag_{}'.format(i) for i in range(df.shape[1])])
    for i in range(df.shape[0]):
        sorted_tags.iloc[i,:] = list( df.iloc[i, :].sort_values(ascending=False).index)
    return sorted_tags

sort_df(gapminder)

Matthias Ossadnik · Accepted Answer

This is probably as fast as it gets with numpy:

def sort_df(df):
    return pd.DataFrame(
        data=df.columns.values[np.argsort(-df.values, axis=1)],
        columns=['tag_{}'.format(i) for i in range(df.shape[1])]
    )

print(sort_df(gapminder.head(3)))

  tag_0 tag_1      tag_2    tag_3
0   pop  year  gdpPercap  lifeExp
1   pop  year  gdpPercap  lifeExp
2   pop  year  gdpPercap  lifeExp

Explanation: np.argsort sorts the values along rows, but returns the indices that sort the array instead of sorted values, which can be used for co-sorting arrays. The minus sorts in descending order. In your case, you use the indices to sort the columns. numpy broadcasting takes care of returning the correct shape.

Runtime is around 3ms for your example vs 2.5s with your function.

How to sort each row of pandas dataframe and return column index based on sorted values of row

Answers (1)

Related Questions