How to get the same percent_rank in SQL and pandas?

Question

I was learning pyspark which uses HiveQL and found it interesting that the percent rank gives two different answers for pyspark-sql and pandas.

Question Source with sql code: https://www.windowfunctions.com/questions/ranking/3

How to get the same result as SQL in pandas?

Two Questions

What is the python code that gives same result as SQL?
What is the SQL code that gives the same result as pandas?

pyspark-sql

q = """
select name, weight,
       percent_rank() over (order by weight) as percent_rank_wt
from cats
order by weight
"""
spark.sql(q).show()

SQL gives this table. I would like same table using pandas.

+-------+------+-------------------+
|   name|weight|    percent_rank_wt|
+-------+------+-------------------+
| Tigger|   3.8|                0.0|
|  Molly|   4.2|0.09090909090909091|
|  Ashes|   4.5|0.18181818181818182|
|Charlie|   4.8| 0.2727272727272727|
| Smudge|   4.9|0.36363636363636365|
|  Felix|   5.0|0.45454545454545453|
|   Puss|   5.1| 0.5454545454545454|
| Millie|   5.4| 0.6363636363636364|
|  Alfie|   5.5| 0.7272727272727273|
|  Misty|   5.7| 0.8181818181818182|
|  Oscar|   6.1| 0.9090909090909091|
| Smokey|   6.1| 0.9090909090909091|
+-------+------+-------------------+

pandas

methods = {'average', 'min', 'max', 'first', 'dense'}

df[['name','weight']].sort_values('weight').assign(
     pct_avg=df['weight'].rank(pct=True,method='average'),
     pct_min=df['weight'].rank(pct=True,method='min'),
     pct_max=df['weight'].rank(pct=True,method='max'),
     pct_first=df['weight'].rank(pct=True,method='first'),
     pct_dense=df['weight'].rank(pct=True,method='dense')
).sort_values('weight')
       name  weight   pct_avg   pct_min   pct_max  pct_first  pct_dense
4    Tigger     3.8  0.083333  0.083333  0.083333   0.083333   0.090909
0     Molly     4.2  0.166667  0.166667  0.166667   0.166667   0.181818
1     Ashes     4.5  0.250000  0.250000  0.250000   0.250000   0.272727
11  Charlie     4.8  0.333333  0.333333  0.333333   0.333333   0.363636
3    Smudge     4.9  0.416667  0.416667  0.416667   0.416667   0.454545
2     Felix     5.0  0.500000  0.500000  0.500000   0.500000   0.545455
9      Puss     5.1  0.583333  0.583333  0.583333   0.583333   0.636364
7    Millie     5.4  0.666667  0.666667  0.666667   0.666667   0.727273
5     Alfie     5.5  0.750000  0.750000  0.750000   0.750000   0.818182
8     Misty     5.7  0.833333  0.833333  0.833333   0.833333   0.909091
6     Oscar     6.1  0.958333  0.916667  1.000000   0.916667   1.000000
10   Smokey     6.1  0.958333  0.916667  1.000000   1.000000   1.000000

setup

import numpy as np
import pandas as pd

import pyspark
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark import SparkConf, SparkContext, SQLContext
spark = pyspark.sql.SparkSession.builder.appName('app').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

df = pd.DataFrame({
    'name': [
        'Molly', 'Ashes', 'Felix', 'Smudge', 'Tigger', 'Alfie', 'Oscar',
        'Millie', 'Misty', 'Puss', 'Smokey', 'Charlie'
    ],
    'breed': [
        'Persian', 'Persian', 'Persian', 'British Shorthair',
        'British Shorthair', 'Siamese', 'Siamese', 'Maine Coon', 'Maine Coon',
        'Maine Coon', 'Maine Coon', 'British Shorthair'
    ],
    'weight': [4.2, 4.5, 5.0, 4.9, 3.8, 5.5, 6.1, 5.4, 5.7, 5.1, 6.1, 4.8],
    'color': [
        'Black', 'Black', 'Tortoiseshell', 'Black', 'Tortoiseshell', 'Brown',
        'Black', 'Tortoiseshell', 'Brown', 'Tortoiseshell', 'Brown', 'Black'
    ],
    'age': [1, 5, 2, 4, 2, 5, 1, 5, 2, 2, 4, 4]
})

schema = StructType([
    StructField('name', StringType(), True),
    StructField('breed', StringType(), True),
    StructField('weight', DoubleType(), True),
    StructField('color', StringType(), True),
    StructField('age', IntegerType(), True),
])

sdf = sqlContext.createDataFrame(df, schema)
sdf.createOrReplaceTempView("cats")

How to get the same percent_rank in SQL and pandas?

Two Questions

pyspark-sql

pandas

setup

Answers (1)

What is the python code that gives same result as SQL?

What is the SQL code that gives the same result as pandas?

Related Questions