user2205916
user2205916

Reputation: 3456

Remove element from PySpark DataFrame column

I know PySpark DataFrames are immutable, so I'd like to create a new column resulting from a transformation applied to an existing column of the PySpark DataFrame. My data is too large to use collect().

The column in question is a list of lists of unique ints (no repeats of an int in a given list), e.g.:

[1]
[1,2]
[1,2,3]
[2,3]

The above is a toy example, as my actual DataFrame has lists with a max length of 52 unique ints. I would like to generate a column that iterates through the list of lists of ints and removes one element for each loop. The element to be removed would be one from the set of unique elements across all lists, which in this case is [1,2,3].

So for the first iteration:

Remove element 1, such that the results are:

[]
[2]
[2,3]
[2,3]

For the second iteration:

Remove element 2, such that the results are:

[1]
[1]
[1,3]
[3]

etc. and repeat above with element 3.

For each iteration, I would like to append the results to the original PySpark DataFrame to run some queries against, using this "filtered" column as a row filter for the original DataFrame.

My question is, how do I convert a column of a PySpark DataFrame to a list? My data set is large, so df.select('columnofintlists').collect() results in memory issues (e.g.: Kryo serialization failed: Buffer overflow. Available: 0, required: 1448662. To avoid this, increase spark.kryoserializer.buffer.max value.).

Upvotes: 1

Views: 2530

Answers (2)

muon
muon

Reputation: 14037

Here is an example from pyspark docs:

>>>from pyspark.sql.functions import array_remove
>>>from pyspark.sql import SparkSession, SQLContext

>>>sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
>>>spark = SparkSession(sc)

>>>df = spark.createDataFrame([([1, 2, 3, 1, 1],), ([],)], ['data'])

>>>df.select(array_remove(df.data, 1)).collect()
[Row(array_remove(data, 1)=[2, 3]), Row(array_remove(data, 1)=[])]

Reference: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=w

Upvotes: 1

Zhang Tong
Zhang Tong

Reputation: 4719

df.toLocalIterator() will return an iterator for loop

Upvotes: 0

Related Questions