sopana
sopana

Reputation: 375

Pyspark dataframe filter based on in between values

I have a Pyspark dataframe with below values -

[Row(id='ABCD123',  score='28.095238095238095'), Row(id='EDFG456', score='36.2962962962963'), Row(id='HIJK789', score='37.56218905472637'), Row(id='LMNO1011', score='36.82352941176471')]

I want only the values from the DF which have score between the input score value and input score value + 1, say, the input score value is 36 then I want the output DF with only two ids - EDFG456 & LMNO1011 as their score falls between 36 & 37. I achieved this by doing as follows -

input_score_value = 36
input_df = my_df.withColumn("score_num", substring(my_df.score, 1,2))
output_matched = input_df.filter(input_df.score_num == input_score_value)
print(output_matched.take(5))

The above code gives the below output, but it takes too long to process 2 mil rows. I was thinking if there is some better way to do this to reduce the response time.

[Row(id='EDFG456', score='36.2962962962963'), Row(id='LMNO1011',score='36.82352941176471')]

Upvotes: 1

Views: 598

Answers (1)

lrnzcig
lrnzcig

Reputation: 3947

You can use the function floor.

from pyspark.sql.functions import floor
output_matched = input_df.filter(foor(input_df.score_num) == input_score_value)
print(output_matched.take(5))

It should be much faster compared to substring. Let me know.

Upvotes: 1

Related Questions