Reputation: 3805
I have recently started working with databricks and azure.
I have microsoft azure storage explorer. I ran a jar program on databricks which outputs many csv files in the azure storgae explorer in the path
..../myfolder/subfolder/output/old/p/
The usual thing I do is to go the folder p
and download all the csv files
by right clicking the p
folder and click download
on my local drive
and these csv files in R to do any analysis.
My issue is that sometimes my runs could generate more than 10000 csv files whose downloading to the local drive takes lot of time.
I wondered if there is a tutorial/R package which helps me to read in the csv files from the path above without downloading them. For e.g. is there any way I can set
..../myfolder/subfolder/output/old/p/
as my working directory and process all the files in the same way I do.
EDIT: the full url to the path looks something like this:
https://temp.blob.core.windows.net/myfolder/subfolder/output/old/p/
Upvotes: 4
Views: 3823
Reputation: 24148
According to the offical document CSV Files
of Azure Databricks, you can directly read a csv file in R of a notebook of Azure Databricks as the R example of the section Read CSV files notebook example
said, as the figure below.
Alternatively, I used R package reticulate
and Python package azure-storage-blob
to directly read a csv file from a blob url with sas token of Azure Blob Storage.
Here is my steps as below.
To install R package reticulate
via code install.packages("reticulate")
.
To install Python package azure-storage-blob
as the code below.
%sh
pip install azure-storage-blob
To run Python script to generate a sas token of container level and to use it to get a list of blob urls with sas token, please see the code below.
library(reticulate)
py_run_string("
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
sas_token = blob_service.generate_container_shared_access_signature(container_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
blob_names = blob_service.list_blob_names(container_name, prefix = 'myfolder/')
blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+sas_token for blob_name in blob_names]
")
blob_urls_with_sas <- py$blob_urls_with_sas
Now, I can use different ways in R to read a csv file from the blob url with sas token, such as below.
5.1. df <- read.csv(blob_urls_with_sas[[1]])
5.2. Using R package data.table
install.packages("data.table")
library(data.table)
df <- fread(blob_urls_with_sas[[1]])
5.3. Using R package readr
install.packages("readr")
library(readr)
df <- read_csv(blob_urls_with_sas[[1]])
Note: for reticulate
library, please refer to the RStudio article Calling Python from R
.
Hope it helps.
Update for your quick question:
Upvotes: 1