OmG
OmG

Reputation: 18838

Select a range in Pyspark

I have a spark dataframe in python. And, it was sorted based on a column. How can I select a specific range of data (for example 50% of data in the middle)? For example, if I have 1M data, I want to take data from 250K to 750K index. How can I do that without using collect in pyspark?

To be more precise, I want something like take function to get results between a range. For example, something like take(250000, 750000).

Upvotes: 2

Views: 8413

Answers (2)

Grant Shannon
Grant Shannon

Reputation: 5075

Here is one way to select a range in a pyspark DF:

Create DF

df = spark.createDataFrame(
    data = [(10, "2018-01-01"), (22, "2017-01-01"), (13, "2014-01-01"), (4, "2015-01-01")\
           ,(35, "2013-01-01"),(26, "2016-01-01"),(7, "2012-01-01"),(18, "2011-01-01")]
    , schema =  ["amount", "date"]
)

df.show()

+------+----------+
|amount|      date|
+------+----------+
|    10|2018-01-01|
|    22|2017-01-01|
|    13|2014-01-01|
|     4|2015-01-01|
|    35|2013-01-01|
|    26|2016-01-01|
|     7|2012-01-01|
|    18|2011-01-01|
+------+----------+

Sort (on date) and insert index (based on row number)

from pyspark.sql.window import Window
from pyspark.sql import functions as F

w = Window.orderBy("date")
df = df.withColumn("index", F.row_number().over(w))

df.show()

+------+----------+-----+
|amount|      date|index|
+------+----------+-----+
|    18|2011-01-01|    1|
|     7|2012-01-01|    2|
|    35|2013-01-01|    3|
|    13|2014-01-01|    4|
|     4|2015-01-01|    5|
|    26|2016-01-01|    6|
|    22|2017-01-01|    7|
|    10|2018-01-01|    8|
+------+----------+-----+

Get The Required Range (assume want everything between rows 3 and 6)

df1=df.filter(df.index.between(3, 6))

df1.show()
+------+----------+-----+
|amount|      date|index|
+------+----------+-----+
|    35|2013-01-01|    3|
|    13|2014-01-01|    4|
|     4|2015-01-01|    5|
|    26|2016-01-01|    6|
+------+----------+-----+

Upvotes: 3

Pushkr
Pushkr

Reputation: 3619

This is very simple using between , for example assuming your sorted column name is index -

df_sample = df.select(df.somecolumn, df.index.between(250000, 750000)) 

once you create a new dataframe df_sample, you can perform any operation (including take or collect) as per your need.

Upvotes: -2

Related Questions