Reputation: 789
I am trying to sort each row of pandas dataframe and get the index of sorted values in a new dataframe. I could do it in a slow way. Can anyone suggest improvements using parallelization or vectorized code for this. I have posted an example below.
data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)
# drop categorical column
gapminder.drop(['country', 'continent'], axis=1, inplace=True)
# print the first three rows
print(gapminder.head(n=3))
year pop lifeExp gdpPercap
0 1952 8425333.0 28.801 779.445314
1 1957 9240934.0 30.332 820.853030
2 1962 10267083.0 31.997 853.100710
The result I am looking for is this
tag_0 tag_1 tag_2 tag_3
0 pop year gdpPercap lifeExp
1 pop year gdpPercap lifeExp
2 pop year gdpPercap lifeExp
In this case, since pop
is always higher than gdpPercap
and lifeExp
, it always comes first.
I could achieve the required output by using the following code. But the computation takes longer time if the df
has lot of rows/columns.
Can anyone suggest an improvement over this
def sort_df(df):
sorted_tags = pd.DataFrame(index = df.index, columns = ['tag_{}'.format(i) for i in range(df.shape[1])])
for i in range(df.shape[0]):
sorted_tags.iloc[i,:] = list( df.iloc[i, :].sort_values(ascending=False).index)
return sorted_tags
sort_df(gapminder)
Upvotes: 0
Views: 2026
Reputation: 911
This is probably as fast as it gets with numpy:
def sort_df(df):
return pd.DataFrame(
data=df.columns.values[np.argsort(-df.values, axis=1)],
columns=['tag_{}'.format(i) for i in range(df.shape[1])]
)
print(sort_df(gapminder.head(3)))
tag_0 tag_1 tag_2 tag_3
0 pop year gdpPercap lifeExp
1 pop year gdpPercap lifeExp
2 pop year gdpPercap lifeExp
Explanation: np.argsort
sorts the values along rows, but returns the indices that sort the array instead of sorted values, which can be used for co-sorting arrays. The minus sorts in descending order. In your case, you use the indices to sort the columns. numpy broadcasting takes care of returning the correct shape.
Runtime is around 3ms for your example vs 2.5s with your function.
Upvotes: 2