Abhi
Abhi

Reputation: 123

Covert a Pyspark Dataframe into a List with actual values

I am trying to convert a Pyspark dataframe column to a list of values NOT objects.

Now my ultimate goal is use it as a filter for filtering another dataframe.

I have tries the following:

X = df.select("columnname").collect()

But when I use it to filter I am unable to.

Y = dtaframe.filter(~dtaframe.columnname.isin(X)))

Also, tried to convert into numpy Array and aggregate collect_list()

df.groupby('columnname').agg(collect_list(df["columnname"])

Please advise.

Upvotes: 0

Views: 336

Answers (1)

Sathish
Sathish

Reputation: 44

Collect function returns an array of row object by collecting the data from executors. If you need an array of values in native datatypes, it has to be handled explicitly to fetch the column from the row object.

This code creates a DF with column number of LongType.

df = spark.range(0,10,2).toDF("number")

Convert this into a python list.

num_list = [row.number for row in df.collect()]

Now this list can used in any dataframe to filter the values using isin function.

df1 = spark.range(10).toDF("number")
df1.filter(~col("number").isin(num_list)).show()

Upvotes: 1

Related Questions