Michal
Michal

Reputation: 1905

How do I collect a single column in Spark?

I would like to perform an action on a single column. Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot be collected.

Here is an example:

df = sqlContext.createDataFrame([Row(array=[1,2,3])])
df['array'].collect()

This produces the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'Column' object is not callable

How can I use the collect() function on a single column?

Upvotes: 16

Views: 27297

Answers (2)

Prometheus
Prometheus

Reputation: 563

In 2024, being mindful on python serialization overhead for RDD level actions, the accepted answer with

.rdd.flatmap(lambda x:x)

is not the ideal for high-performance spark.

Whenever using only dataframe API, you are actually working with bytecode constructs. The python code is only declarative, with exception of UDFs.

Whenever you leave the dataframe API context and go to RDD world, you will actually need to (in the default scenario where Apache Arrow is not enabled) serialize the whole data out of the JVM to a python script and then back again. Here, this performance overhead that can be avoided.

To avoid this you can perform the following:

array_var = [ row['array'] for row in df.select("array").collect() ]

In it, you will unwrap the row in a pythonic list-comprehension, instead of Spark RDD invocations.

Upvotes: 0

zero323
zero323

Reputation: 330173

Spark >= 2.0

Starting from Spark 2.0.0 you need to explicitly specify .rdd in order to use flatMap

df.select("array").rdd.flatMap(lambda x: x).collect()

Spark < 2.0

Just select and flatMap:

df.select("array").flatMap(lambda x: x).collect()
## [[1, 2, 3]] 

Upvotes: 22

Related Questions