user3868051
user3868051

Reputation: 1249

Error using filter on column value with Spark dataframe

Please refer to my sample code below:

sampleDf -> my sample Scala dataframe that I want to filter on 2 columns startIPInt and endIPInt.

var row = sampleDf.filter("startIPInt <=" + ip).filter("endIPInt >= " + ip)

I now want to view the content of this row. The following takes barely a sec to execute but does not show me the contents of this row object:

println(row)

But this code takes too long to execute:

row.show()

So my question is how should I view the content of this row object? Or is there any issue with the way I am filtering my dataframe?

My initial approach was to use filter as mentioned here: https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/DataFrame.html#filter(java.lang.String)

According to that, the following line of code gives me an error about "overloaded method 'filter'":

var row = sampleDf.filter($"startIPInt" <= ip).filter($"endIPInt" >= ip)

Can anyone help me understand what is happening here? And which is the right and fastest way to filter and get content of a dataframe as above.

Upvotes: 0

Views: 1010

Answers (1)

Shaido
Shaido

Reputation: 28322

First, using filter you don't really get a row/row object, you will get a new dataframe.

The reason show takes longer to execute is due to Spark being lazy. It will only compute transformations when an action is taken on the dataframe (see e.g. Spark Transformation - Why its lazy and what is the advantage?). Using println on a dataframe will not do anything and the filter transformations will not actually be computed. show on the other hand requires some computation which is why it's slower to execute.

Using

sampleDf.filter("startIPInt <=" + ip).filter("endIPInt >= " + ip)

and

sampleDf.filter($"startIPInt" <= ip).filter($"endIPInt" >= ip)

are equivalent and should give the same result as long as you have imported spark implicits (for use of the $ notation).

Upvotes: 2

Related Questions