Matias Cardona
Matias Cardona

Reputation: 1

Foundry code workbooks are too slow, how to iterate faster?

I've noticed that code workbooks are too slow when querying from tables. It is much slower than using SQL from a data warehouse. What is the correct workflow to quickly pull and join data for iterating analysis?

Upvotes: 0

Views: 573

Answers (2)

nicornk
nicornk

Reputation: 673

"What is the correct workflow to quickly pull and join data for iterating analysis?"

For quick one-off analysis I would recommend to use the Foundry JDBC/ODBC Driver (installed on your local computer) and query the Foundry SQL Server. Note, this will only work with moderate data set result sizes and low query complexities.

This will allow you to have turnaround times of seconds instead of minutes on your queries.

Upvotes: 0

fmsf
fmsf

Reputation: 37137

As I hinted on the comment, this is very hard to answer because code workbooks were designed for interactivity, so they are normally very fast. This doesn't mean that there can't be reasons for them to become slower. I'll list some here, maybe they can help you speed up:

  • Doing code workbooks straight from raw can be slow! Check how many files and the types of files that back a particular dataset. In raw these may be CSV files and not snappy/parquet which would make your compute faster. Which will lead code workbooks to try to infer schema every time you try to iterate. Adding a simple raw -> clean transform in pyspark code repositories, may help a ton here.

  • Your dataset may be poorly optimized. Having too many files for the datasize. This will lead to code workbooks to take a lot of time hitting disk opening each file. You can verify the files this by going into dataset details tab -> files and check the size of your files. It may be worth to add a repartition on your clean step (same as above). This is spark, not foundry read more here Is it better to have one large parquet file or lots of smaller parquet files?

  • Your organization may not have enough resources for your compute, or you may have too many people using code workbooks at the same time, for whatever quota your set up. This is something you'll need to check with your platform team, or support channels.

  • Using AQE and Local mode: How do I get better performance in my Palantir Foundry transformation when my data scale is small?

  • If you are using python: Not using udfs, these can make your code particularly slow, specially if you are comparing against SQL. PySpark UDFs are known for being notoriously slow Spark functions vs UDF performance?

Upvotes: 1

Related Questions