ankit
ankit

Reputation: 347

Azure ML Notebook: The code being run in the notebook may have caused a crash or the compute may have run out of memory

I am using Azure ML Notebook with python kernel to run the following code:

%reload_ext rpy2.ipython

from azureml.core import Dataset, Datastore,Workspace

subscription_id = 'abc'
resource_group = 'pqr'
workspace_name = 'xyz'

workspace = Workspace(subscription_id, resource_group, workspace_name)
datastore = Datastore.get(workspace, 'mynewdatastore')

# create tabular dataset from all parquet files in the directory
tabular_dataset_1 = Dataset.Tabular.from_parquet_files(path=(datastore,'/RNM/CRUD_INDIFF/CrudeIndiffOutput_PRD/RW_Purchases/2022-09-05/RW_Purchases_2022-09-05T17:23:01.01.parquet'))
df=tabular_dataset_1.to_pandas_dataframe()
print(df)

After executing this code, I am getting the Cancelled message from the notebook cell and also getting the message on top of the cell as:

The code being run in the notebook may have caused a crash or the compute may have run out of memory.
Jupyter kernel is now idle.
Kernel restarted on the server. Your state is lost.

2 cores, 14 GB RAM and 28 GB Disk Space is allocated to the compute instance. The Parquet file which I am using in the code is of size 20.25 GiB and I think due to the large size of this file, this problem is being created. Can anyone please help me how to resolve this error without breaking the file into multiple files of small sizes. Any help would be appreciated.

Upvotes: 3

Views: 1926

Answers (2)

Muhammad Pathan
Muhammad Pathan

Reputation: 74

On reading the dataset using the Pandas read_ function, default data types are assigned to each feature column. By observing feature values Pandas decides data type and loads it in the RAM. A value with data type as int8 takes 8x times less memory compared to int64 data type so could change datatypes to use small int,floats etc. I suspect the error is caused because of 14gb RAM.

like @ndclt says you can load data in chunks. Try that first but
If that does not work, I would move away from using pandas entirely. Use an alternative such as pyspark,dask,polars instead.

The following libraries listed are much more ideal for your situation as they are a lot more efficient and a lot faster when dealing with larger amounts of data.

looks like there is a method to load data into spark data frame from azure Dataset Class so this is more ideal for what you are doing . First you need to make sure you have a spark cluster setup which you can do in azure synapse. Then then link it to azureml workspace

create spark cluster- https://learn.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-apache-spark-pool-portal

link synapse to ml workspace- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-link-synapse-ml-workspaces.

dataset to spark df- https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset(class)?view=azure-ml-py#azureml-core-dataset-to-spark-dataframe

There is a lot more detail about this if you go onto notebook samples in azure ml. There should be a folder called azure-synapse which has good info and code samples.

once you setup spark cluster and link to azureml workspace should just be able to do the following

df=tabular_dataset_1.to_spark_dataframe()

Upvotes: 2

ndclt
ndclt

Reputation: 3168

The Parquet file which I am using in the code is of size 20.25 GiB and I think due to the large size of this file, this problem is being created

Yes surely. And as parquet can be compressed, the size of the file uncompressed could be bigger and the library (from azure or pandas) will add some overhead.

For not loading the whole file, there is two ideas:

  • load few rows,
  • load less columns (not all of them).

From what I read in the documentation of Dataset.Tabular.from_parquet_files, I cannot find any way to apply one of two methods above. :/

But, you can maybe trick by downloading the file on the server (find in this answer) and after read by chunk (find there) or partially load the columns.

from azureml.core import Dataset, Datastore,Workspace
import pyarrow.parquet as pq


subscription_id = 'abc'
resource_group = 'pqr'
workspace_name = 'xyz'
dstore_path = '/RNM/CRUD_INDIFF/CrudeIndiffOutput_PRD/RW_Purchases/2022-09-05'
parquet_file_name = 'RW_Purchases_2022-09-05T17:23:01.01.parquet'

workspace = Workspace(subscription_id, resource_group, workspace_name)
datastore = Datastore.get(workspace, 'mynewdatastore')

target = (datastore, dstore_path)
with tempfile.TemporaryDirectory() as tmpdir:
    ds = Dataset.File.from_files(target)
    ds.download(tmpdir)
    # you have the parquet file in tmpdir. You can read it by chunk or select
    # the column you need (if you can)
    pq_file = pq.ParquetFile(f'tmpdir/{parquet_file_name}')
    for batch in pq_file.iter_batches():
        print("RecordBatch")
        batch_df = batch.to_pandas()
        # do thing with the batch
    

Iter_batches documentation with the columns argument allowing you to load only some columns.

Working by batch implies that you don't need the whole file to be loaded. If it's the case, you will have to change the machine used for your Jupyter notebook.

Upvotes: 3

Related Questions