Reputation: 156
How can I -- in Palantir-foundry -- import and use the "Koalas: pandas API for Apache Spark" open source python package.
I know that you can import packages that don't exist through Code Repo and have done this, can I do this same process for Koalas package or do I need to follow another route?
Upvotes: 2
Views: 641
Reputation: 639
Koalas is officially included in PySpark as **pandas API on Spark** in Apache Spark 3.2. In Spark 3.2+, you no longer need to import koalas, as it comes with pyspark. The only required action is to add pandas and pyarrow as these are required dependencies that Code Repositories don't include by default. You can do so via Libraries tab.
You can confirm that it works using this test transform
@transform_df(
Output("OUTPUT_DATASET_PATH"),
)
def compute():
import pyspark.pandas as ps
psdf = ps.DataFrame(
{'a': [1, 2, 3, 4, 5, 6],
'b': [100, 200, 300, 400, 500, 600],
'c': ["one", "two", "three", "four", "five", "six"]},
index=[10, 20, 30, 40, 50, 60])
return psdf.to_spark()
To confirm that you are using Spark 3.2+ in your Code repository, please merge any pending upgrade PRs. Prior to Spark 3.2, it was possible to import koalas through Libraries tab
Upvotes: 0
Reputation: 156
I was able to use Code Repo to upload a local clone of the package and then add the package in platform using the steps detailed here: How to create python libraries and how to import it in palantir foundry
However, shortly afterwards Palantir admins introduced an update which included the Koalas package as a native package to the platform. I have not however had time to try using this for any major tasks as of yet.
Upvotes: 3