Pyspark dataframe parquet vs delta : different number of rows

Question

I have data written in Delta on HDFS. From what I understand, Delta is storing the data as parquet, just has an additional layer over it with advanced features.

But when reading data with Pyspark, I get a different result if dataframe is read with spark.read.parquet() or spark.read.format('delta').load()

df = spark.read.format('delta').load("my_data")
df.count()
> 184511389

df = spark.read.parquet("my_data")
df.count()
> 369022778

As you can see the difference is quite big.

Is there something I misunderstood about delta vs parquet?

Pyspark version is 2.4.

Alex Ott · Accepted Answer

The most probable explanation is that you wrote into the Delta two times using the overwrite option. But Delta is versioned data format - when you use overwrite, it doesn't delete previous data, it just writes new files, and don't delete files immediately - they are just marked as deleted in the manifest file that Delta uses. And when you read from Delta, it knows which files are deleted, or not, and read only actual data. Actual deletion of the data files happens when you're performing VACUUM on Delta lake.

But when you read with Parquet, it doesn't have information about deleted files, so it reads everything that you have in directory, so you get twice as many rows.

Pyspark dataframe parquet vs delta : different number of rows

Answers (1)

Related Questions