Reputation: 783
I'm working on the Yelp Dataset using Spark Dataframe. I'm having issues with using filter().
It seems I cannot specify integers, only strings?
Here's my code
def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
yelpBusinesses.select("name", "stars", "review_count").filter("stars" == 5, "review_count" >= 1000)
}
Here's one row from the yelp dataset:
{"business_id":"1SWheh84yJXfytovILXOAQ","name":"Arizona Biltmore Golf Club","address":"2818 E Camino Acequia Drive","city":"Phoenix","state":"AZ","postal_code":"85016","latitude":33.5221425,"longitude":-112.0184807,"stars":3.0,"review_count":5,"is_open":0,"attributes":{"GoodForKids":"False"},"categories":"Golf, Active Life","hours":null}
Clearly the stars and review_count are both integers, and not strings.
The output of my function should be a DataFrame with the names, stars and review_count of all business with 5 stars, and more than or equal to a 1000 review_count.
Upvotes: 0
Views: 2529
Reputation: 2451
try cast to int
import spark.implicits._
def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
yelpBusinesses.select('name, 'stars, 'review_count)
.filter('stars.cast("int") === 5 || 'review_count.cast("int") >= 1000)
}
Upvotes: 1
Reputation: 1274
I would try:
import spark.implicits._
def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
yelpBusinesses.select("name", "stars", "review_count")
.filter($"stars" === 5 && $"review_count" >= 1000)
}
Upvotes: 1
Reputation: 2397
Try to use this:
def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
yelpBusinesses.select("name", "stars", "review_count")
.filter("$stars" == 5 && "$review_count" >= 1000)
}
or like this:
import org.apache.spark.sql.functions._
def fiveStarBusinessesDF(yelpBusinesses: DataFrame):DataFrame = {
yelpBusinesses.select("name", "stars", "review_count")
.filter(col("stars") == lit(5) && col("review_count") >= lit(1000))
}
Upvotes: 1