Reputation: 5031
i have a dataframe with multiple columns, and I need to select 2 of them and dump them to a list, and i've tried the following :
df.show()
+------------------------------------+---------------+---------------+
|email_address |topic |user_id |
+------------------------------------+---------------+---------------+
|[email protected] |hello_world |xyz123 |
+------------------------------------+---------------+---------------+
|[email protected] |hello_kitty |lmn456 |
+------------------------------------+---------------+---------------+
the result I need is a list of tuples:
[([email protected], xyz123), ([email protected], lmn456)]
the way I tried:
tuples = df.select(col('email_address'), col('topic')).rdd.flatMap(lambda x, y: list(x, y)).collect()
and it throw errors:
Py4JJavaError Traceback (most recent call last)
<command-4050677552755250> in <module>()
--> 114 tuples = df.select(col('email_address'), col('topic')).rdd.flatMap(lambda x, y: list(x, y)).collect()
115
116
/databricks/spark/python/pyspark/rdd.py in collect(self)
829 # Default path used in OSS Spark / for non-credential passthrough clusters:
830 with SCCallSiteSync(self.context) as css:
--> 831 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
832 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
833
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
how to fix it?
Upvotes: 0
Views: 2134
Reputation: 32670
You should be using map
for that:
tuples = df.select(col('email_address'), col('topic')) \
.rdd \
.map(lambda x: (x[0], x[1])) \
.collect()
print(tuples)
# output
[('[email protected]', 'hello_world'), ('[email protected]', 'hello_kitty')]
Another way is to collect rows for DataFrame and then loop to get values:
rows = df.select(col('email_address'), col('topic')).collect()
tuples = [(r.email_address, r.topic) for r in rows]
print(tuples)
# output
[('[email protected]', 'hello_world'), ('[email protected]', 'hello_kitty')]
Upvotes: 1