random.me
random.me

Reputation: 425

How best to convert from azure blob csv format to pandas dataframe while running notebook in azure ml

I have a number of large csv (tab delimited) data stored as azure blobs, and I want to create a pandas dataframe from these. I can do this locally as follows:

from azure.storage.blob import BlobService
import pandas as pd
import os.path

STORAGEACCOUNTNAME= 'account_name'
STORAGEACCOUNTKEY= "key"
LOCALFILENAME= 'path/to.csv'        
CONTAINERNAME= 'container_name'
BLOBNAME= 'bloby_data/000000_0'

blob_service = BlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)

# Only get a local copy if haven't already got it
if not os.path.isfile(LOCALFILENAME):
    blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)

df_customer = pd.read_csv(LOCALFILENAME, sep='\t')

However, when running the notebook on azure ML notebooks, I can't 'save a local copy' and then read from csv, and so I'd like to do the conversion directly (something like pd.read_azure_blob(blob_csv) or just pd.read_csv(blob_csv) would be ideal).

I can get to the desired end result (pandas dataframe for blob csv data), if I first create an azure ML workspace, and then read the datasets into that, and finally using https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python to access the dataset as a pandas dataframe, but I'd prefer to just read straight from the blob storage location.

Upvotes: 12

Views: 30401

Answers (5)

hui chen
hui chen

Reputation: 1074

The accepted answer will not work in the latest Azure Storage SDK. MS has rewritten the SDK completely. It's kind of annoying if you are using the old version and update it. The below code should work in the new version.

from azure.storage.blob import ContainerClient
from io import StringIO
import pandas as pd

conn_str = ""
container_name = ""
blob_name = ""

# Create a ContainerClient instance via connection string auth.
container_client = ContainerClient.from_connection_string(conn_str, container_name)
# Download blob as StorageStreamDownloader object (stored in memory)
downloaded_blob = container_client.download_blob(blob_name, encoding='utf8')

df = pd.read_csv(StringIO(downloaded_blob.readall()), low_memory=False)

Upvotes: 24

DataFramed
DataFramed

Reputation: 1631

Simple Answer:

Working as on 12th June 2022

Below are the steps to read a CSV file from Azure Blob into a Jupyter notebook dataframe (python).

STEP 1: First generate a SAS token & URL for the target CSV(blob) file on Azure-storage by right-clicking the blob/storage CSV file(blob file). enter image description here

STEP 2: Copy the Blob SAS URL that appears below the button used for generating SAS token and URL.

STEP 3: Use the below line of code in your Jupyter notbook to import the desired CSV. Replace url value with your Blob SAS URL copied in the above step.

import pandas as pd 
url ='Your Blob SAS URL'
df = pd.read_csv(url)
df.head()

Upvotes: 6

Michał Zawadzki
Michał Zawadzki

Reputation: 752

Use ADLFS (pip install adlfs), which is an fsspec-compatible API for Azure lakes (gen1 and gen2):

storage_options = {
    'tenant_id': tenant_id,
    'account_name': account_name,
    'client_id': client_id,
    'client_secret': client_secret
}

url = 'az://some/path.csv'
pd.read_csv(url, storage_options=storage_options)

Upvotes: 4

maxymoo
maxymoo

Reputation: 36555

I think you want to use get_blob_to_bytes, or get_blob_to_text; these should output a string which you can use to create a dataframe as

from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
df = pd.read_csv(StringIO(blobstring))

Upvotes: 18

James Nguyen
James Nguyen

Reputation: 141

Thanks for the answer, I think some correction is needed. You need to get content from the blob object and in the get_blob_to_text there's no need for the local file name.

from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))

Upvotes: 11

Related Questions