doomdaam
doomdaam

Reputation: 783

Filter won't take in integers? Spark DataFrame

I'm working on the Yelp Dataset using Spark Dataframe. I'm having issues with using filter().

It seems I cannot specify integers, only strings?

Here's my code

def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
    yelpBusinesses.select("name", "stars", "review_count").filter("stars" == 5, "review_count" >= 1000)
  }

Here's one row from the yelp dataset:

{"business_id":"1SWheh84yJXfytovILXOAQ","name":"Arizona Biltmore Golf Club","address":"2818 E Camino Acequia Drive","city":"Phoenix","state":"AZ","postal_code":"85016","latitude":33.5221425,"longitude":-112.0184807,"stars":3.0,"review_count":5,"is_open":0,"attributes":{"GoodForKids":"False"},"categories":"Golf, Active Life","hours":null}

Clearly the stars and review_count are both integers, and not strings.

The output of my function should be a DataFrame with the names, stars and review_count of all business with 5 stars, and more than or equal to a 1000 review_count.

Upvotes: 0

Views: 2529

Answers (3)

chlebek
chlebek

Reputation: 2451

try cast to int

    import spark.implicits._
    def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
        yelpBusinesses.select('name, 'stars, 'review_count)
                      .filter('stars.cast("int") === 5 || 'review_count.cast("int") >= 1000)
      }

Upvotes: 1

RudyVerboven
RudyVerboven

Reputation: 1274

I would try:

    import spark.implicits._
    def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
       yelpBusinesses.select("name", "stars", "review_count")
                   .filter($"stars" === 5 && $"review_count" >= 1000)
    }

Upvotes: 1

Gal Naor
Gal Naor

Reputation: 2397

Try to use this:

def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
    yelpBusinesses.select("name", "stars", "review_count")
                  .filter("$stars" == 5 && "$review_count" >= 1000)
  }

or like this:

import org.apache.spark.sql.functions._

def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
        yelpBusinesses.select("name", "stars", "review_count")
                      .filter(col("stars") == lit(5) && col("review_count") >= lit(1000))
      }

Upvotes: 1

Related Questions