Reputation: 11
currently working within a dev environment in Databricks using a notebook to apply some Python code to analyse some dummy data (just a few 1,000 rows) held in within a database table, I then deploy this to the main environment and run it on the real data, (100's of millions of rows)
to start with I just need values from a single column that meet a certain criteria, in order to get at the data I'm currently doing this:
then the rest of the Python notebook does its thing on that data which works fine in the dev environment but when I run it for real it falls over at line 2 saying it's out of memory
I want to import the data DIRECTLY into the Pandas dataframe and so remove the need to convert from Spark as I'm assuming that will avoid the error but after a LOT of Googling I still can't work out how, the only thing I've tried that appears syntactically valid is:
data = pd.read_table (r'database.table')
but just get:
'PermissionError: [Errno 13] Permission denied:'
(nb. unfortunately I have no control over the content, form or location of the database I'm querying)
Upvotes: 1
Views: 9038
Reputation: 1242
Your assumption is very likely to be untrue.
Spark is a distributed computation engine, pandas is a single-node toolset. So when you run a query on milions of rows it's likely to fail. When doing df.toPandas, Spark moves all of the data to your driver node, so if it's more than driver memory, it's going to fail with out of memory exception. In other words, if your dataset is larger then memory, pandas are not going to work well.
Also, when using pandas on databricks you are missing all of the benefits of using the underlying cluster. You are just using the driver.
There are two sensible options to solve this:
Upvotes: 1