Reputation: 31

Error while filtering a pyspark DataFrame

I am attempting to filter my DataFrame to remove entries with counts less than 100. The DataFrame results from the "COMBINED" is below:

Row(movieID=26, avg(rating)=3.452054794520548, count=73)

When I run the code below, I get the following error:

TypeError: '>=' not supported between instances of 'method' and 'int'

movieDataset = spark.createDataFrame(movies)
movieratings = movieDataset.groupBy("movieID").mean().drop("avg(movieID)")
topMovieIDs = movieDataset.groupBy("movieID").count()
combined = movieratings.join(topMovieIDs, on=["movieID"], how='inner')
filtered = combined.filter(combined.count >= 100).collect()

How can I filter the DataFrame by the count with 100 or greater?

Upvotes: 1

Answers (2)

Steven

Reputation: 15283

Try this :

filtered = combined.filter(combined["count"] >= 100).collect()

count is by default a dataframe method name. Using combined.count is ambigouis and can refer to both the method or the column, so you have to be more specific.

This should work too :

from pyspark.sql.functions import col
filtered = combined.filter(col("count") >= 100).collect()

Upvotes: 0

Patrick McKercher

Reputation: 31

Disregard, got it to work.

should looke like

filtered = combined.filter(combined[2] >= 100).collect()

Upvotes: 0

Error while filtering a pyspark DataFrame

Answers (2)

Related Questions