Reputation: 31
I am attempting to filter my DataFrame to remove entries with counts less than 100. The DataFrame results from the "COMBINED" is below:
Row(movieID=26, avg(rating)=3.452054794520548, count=73)
When I run the code below, I get the following error:
TypeError: '>=' not supported between instances of 'method' and 'int'
movieDataset = spark.createDataFrame(movies)
movieratings = movieDataset.groupBy("movieID").mean().drop("avg(movieID)")
topMovieIDs = movieDataset.groupBy("movieID").count()
combined = movieratings.join(topMovieIDs, on=["movieID"], how='inner')
filtered = combined.filter(combined.count >= 100).collect()
How can I filter the DataFrame by the count with 100 or greater?
Upvotes: 1
Views: 820
Reputation: 15283
Try this :
filtered = combined.filter(combined["count"] >= 100).collect()
count
is by default a dataframe method name. Using combined.count
is ambigouis and can refer to both the method or the column, so you have to be more specific.
This should work too :
from pyspark.sql.functions import col
filtered = combined.filter(col("count") >= 100).collect()
Upvotes: 0
Reputation: 31
Disregard, got it to work.
should looke like
filtered = combined.filter(combined[2] >= 100).collect()
Upvotes: 0