Reputation: 1905
I would like to perform an action on a single column. Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot be collected.
Here is an example:
df = sqlContext.createDataFrame([Row(array=[1,2,3])])
df['array'].collect()
This produces the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'Column' object is not callable
How can I use the collect()
function on a single column?
Upvotes: 16
Views: 27297
Reputation: 563
In 2024, being mindful on python serialization overhead for RDD level actions, the accepted answer with
.rdd.flatmap(lambda x:x)
is not the ideal for high-performance spark.
Whenever using only dataframe API, you are actually working with bytecode constructs. The python code is only declarative, with exception of UDFs.
Whenever you leave the dataframe API context and go to RDD world, you will actually need to (in the default scenario where Apache Arrow is not enabled) serialize the whole data out of the JVM to a python script and then back again. Here, this performance overhead that can be avoided.
To avoid this you can perform the following:
array_var = [ row['array'] for row in df.select("array").collect() ]
In it, you will unwrap the row in a pythonic list-comprehension, instead of Spark RDD invocations.
Upvotes: 0
Reputation: 330173
Spark >= 2.0
Starting from Spark 2.0.0 you need to explicitly specify .rdd
in order to use flatMap
df.select("array").rdd.flatMap(lambda x: x).collect()
Spark < 2.0
Just select
and flatMap
:
df.select("array").flatMap(lambda x: x).collect()
## [[1, 2, 3]]
Upvotes: 22