Reputation: 651
I have a pyspark dataframe like this.
data = [("1", "a"), ("2", "a"), ("3", "b"), ("4", "a")]
df = spark.createDataFrame(data).toDF(*("id", "name"))
df.show()
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| a|
| 3| b|
| 4| a|
+---+----+
I group by this dataframe by name column.
df.groupBy("name").count().show()
+----+-----+
|name|count|
+----+-----+
| a| 3|
| b| 1|
+----+-----+
Now, after I groupby the dataframe, I am trying to filter the names that their count is lower than 3. For example, here I am looking to get something like this:
+----+-----+
|name|count|
+----+-----+
| b| 1|
+----+-----+
Upvotes: 1
Views: 5314
Reputation: 339
try this:
from pyspark.sql import functions as F
data = [("1", "a"), ("2", "a"), ("3", "b"), ("4", "a")]
df = spark.createDataFrame(data).toDF(*("id", "name"))
df.groupBy("name").count().where(F.col('count') < 3).show()
F
is the alias of functions, you can use any identifier you want, but it is usually written as F
or func
, which is just a personal habit.
result:
+----+-----+
|name|count|
+----+-----+
| b| 1|
+----+-----+
Upvotes: 4