Reputation: 1249
Please refer to my sample code below:
sampleDf
-> my sample Scala dataframe that I want to filter on 2 columns startIPInt
and endIPInt
.
var row = sampleDf.filter("startIPInt <=" + ip).filter("endIPInt >= " + ip)
I now want to view the content of this row. The following takes barely a sec to execute but does not show me the contents of this row object:
println(row)
But this code takes too long to execute:
row.show()
So my question is how should I view the content of this row object? Or is there any issue with the way I am filtering my dataframe?
My initial approach was to use filter as mentioned here: https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/DataFrame.html#filter(java.lang.String)
According to that, the following line of code gives me an error about "overloaded method 'filter'":
var row = sampleDf.filter($"startIPInt" <= ip).filter($"endIPInt" >= ip)
Can anyone help me understand what is happening here? And which is the right and fastest way to filter and get content of a dataframe as above.
Upvotes: 0
Views: 1010
Reputation: 28322
First, using filter
you don't really get a row/row object, you will get a new dataframe.
The reason show
takes longer to execute is due to Spark being lazy. It will only compute transformations when an action is taken on the dataframe (see e.g. Spark Transformation - Why its lazy and what is the advantage?). Using println
on a dataframe will not do anything and the filter transformations will not actually be computed. show
on the other hand requires some computation which is why it's slower to execute.
Using
sampleDf.filter("startIPInt <=" + ip).filter("endIPInt >= " + ip)
and
sampleDf.filter($"startIPInt" <= ip).filter($"endIPInt" >= ip)
are equivalent and should give the same result as long as you have imported spark implicits (for use of the $
notation).
Upvotes: 2