Reputation: 18838
I have a spark dataframe in python. And, it was sorted based on a column. How can I select a specific range of data (for example 50% of data in the middle)? For example, if I have 1M data, I want to take
data from 250K to 750K index. How can I do that without using collect
in pyspark?
To be more precise, I want something like take
function to get results between a range. For example, something like take(250000, 750000)
.
Upvotes: 2
Views: 8413
Reputation: 5075
Here is one way to select a range in a pyspark DF:
Create DF
df = spark.createDataFrame(
data = [(10, "2018-01-01"), (22, "2017-01-01"), (13, "2014-01-01"), (4, "2015-01-01")\
,(35, "2013-01-01"),(26, "2016-01-01"),(7, "2012-01-01"),(18, "2011-01-01")]
, schema = ["amount", "date"]
)
df.show()
+------+----------+
|amount| date|
+------+----------+
| 10|2018-01-01|
| 22|2017-01-01|
| 13|2014-01-01|
| 4|2015-01-01|
| 35|2013-01-01|
| 26|2016-01-01|
| 7|2012-01-01|
| 18|2011-01-01|
+------+----------+
Sort (on date) and insert index (based on row number)
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window.orderBy("date")
df = df.withColumn("index", F.row_number().over(w))
df.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 18|2011-01-01| 1|
| 7|2012-01-01| 2|
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
| 22|2017-01-01| 7|
| 10|2018-01-01| 8|
+------+----------+-----+
Get The Required Range (assume want everything between rows 3 and 6)
df1=df.filter(df.index.between(3, 6))
df1.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
+------+----------+-----+
Upvotes: 3
Reputation: 3619
This is very simple using between
, for example assuming your sorted column name is index
-
df_sample = df.select(df.somecolumn, df.index.between(250000, 750000))
once you create a new dataframe df_sample, you can perform any operation (including take or collect) as per your need.
Upvotes: -2