user8902273
user8902273

Reputation: 21

How to sort rows of dataframe in pyspark

I have a dataframe with two columns which contain numbers and i need to sort the dataframe row wise and not column wise. Everywhere it is given how to sort a dataframe by column but nowhere I can find how to sort all the rows of dataframe in pyspark

col1    col2

2        1

3        2

Expected output

col1   col2

1       2

2       3

Upvotes: 1

Views: 3664

Answers (1)

Wong Tat Yau
Wong Tat Yau

Reputation: 983

You may need some workaround to produce your desired result.

Here is an example to sort data based on a row.

From your dataframe, you may need create an index first.

df = spark.createDataFrame([['index1',3,2,1], ['index2',2,1,3]], ['index', 'a', 'b', 'c']) 
columns = [i for i in df.columns if i != 'index'] 
df.show()

enter image description here

def sort_row_df(row_to_sort):
    row_data = df.filter(col('index')==row_to_sort).collect()[0] 

    sorted_row = sorted([[row_data[col_], col_] for col_ in columns])

    rearrange_col = [i[1] for i in sorted_row]

    return df.select("index", *rearrange_col)   

Lets say you wish to sort based on row 'index1',

row_to_sort = 'index1'
sorted_df = sort_row_df(row_to_sort)
sorted_df.show()

enter image description here

To sort based on row 'index2',

row_to_sort = 'index2'
sorted_df = sort_row_df(row_to_sort)
sorted_df.show()

enter image description here

If you want to sort all data based on rows, i would suggest you just to transpose all the data, sorts it, and transpose it back again. You may refer on how to transpose df in pyspark.

Upvotes: 1

Related Questions