mytabi
mytabi

Reputation: 779

Databricks spark.readstream format differences

I am having confusion on the difference of the following code in Databricks

spark.readStream.format('json')

vs

spark.readStream.format('cloudfiles').option('cloudFiles.format', 'json')

I know cloudfiles as the format would be regarded as Databricks Autoloader . In performance/function comparison , which one is better ? Anyone has some experience on that?

Thanks

Upvotes: 2

Views: 4050

Answers (2)

Aman_dgba
Aman_dgba

Reputation: 1

In my experience, for scheduled batch jobs I have found spark streaming to be ~60% faster on average. You can use autoloader for realtime event based updates, as between events it wont use compute (i.e. you do not need to poll event source continuously)

Ref data of a batch job:

Spark read/writestream - 8412 rows in 1min 52 secs
Autoloaded - 8412 records in 5 min 30 seconds

Upvotes: 0

Alex Ott
Alex Ott

Reputation: 87299

There are multiple differences between these two. When you use Auto Loader you get at least, there are more things (see doc for all details):

Upvotes: 6

Related Questions