Reputation: 1618
I am using delta-rs to read and process some delta tables with Pandas.
I made several experiments with the following pretty simple code:
from deltalake import write_deltalake, DeltaTable
df = DeltaTable(s3_table_URI, storage_options={"AWS_REGION": "eu-west-1"}).to_pandas()
write_deltalake(s3_table_URI, df, mode="overwrite", schema_mode="overwrite")
I tried with two different tables:
For the smaller table, the code works fine with no problems at all.
However, for the bigger one, the code reads in memory the data successfully and I am able to process it with some transformations (that here I omitted) but the code gets stuck in writing. It looks like it doesn't really matter which transformations I am doing as the code gets stuck also with just read/writing operation with no intermediate transformations.
I am using a m4.10xlarge with 160GB of ram and 40 cores single node on Databricks, I thought it could have been an OOM issue, but I still have plenty of memory available when it gets stuck in writing (more than 50GB). After 3 hours, the cluster is still there doing apparently nothing when the writing command is executed.
The same code executed with Polars (that under the hood uses delta-rs) works perfectly, with no problems:
import polars as pl
df = pl.read_delta(s3_table_URI, storage_options={"AWS_REGION": "eu-west-1"})
df.write_delta(s3_table_URI, mode="overwrite", storage_options={"AWS_REGION": "eu-west-1"})
Anyone experienced a similar problem? It looks like some sort of bug related to the interaction between Pandas and delta-rs, as Polars works fine.
I am forced to use Pandas because of some other dependencies I am not able to update inside the company I am working for.
Upvotes: 0
Views: 81