Reputation: 69
My company is implementing Azure Data Explorer (ADX) as a backend. They also want to use Databricks for Data Science projects including data exploration. I'm in charge of connecting Databricks to ADX.
I first tried Azure Kusto package.
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
import pandas as pd
...
df = dataframe_from_result_table(RESPONSE.primary_results[0])
Full steps here
Functionlly this works well. But it loses completely the lazy loading feature of both ADX and Databricks-Spark.
I thought that because df is a just a Pandas dataframe, also if I try to convert this to a hive table it persists the data, which is not required, as we need fresh online data we don't want a local copy.
The next thing I've tried was to have this loaded in a spark dataframe. I've tried the following code (after installing the relevant libraries)
df = spark.read. \
format("com.microsoft.kusto.spark.datasource"). \
option("kustoCluster", KUSTO_URI). \
option("kustoDatabase",KUSTO_DATABASE). \
option("kustoQuery", "some_table_in_adx"). \
option("kustoAadAppId",CLIENT_ID). \
option("kustoAadAppSecret",CLIENT_SECRET). \
option("kustoAadAuthorityID", AAD_TENANT_ID).load()
which again loads the data into the spark dataframe without any issue.
However, performance wise it's far far away from the direct query in ADX. A count in a table of 600 thousands records is subseconds in ADX while it's more than 20 seconds in the Databricks notebook on a DS3_V2 14GB 4 cores.
Before even to consider a SaveAsTable or CreateOrReplaceTempView I wonder why I'm experiencing this performance issue. So my questions are :
Thanks for your help
Upvotes: 1
Views: 1321
Reputation: 515
For the Python part -
Pandas is not Spark dataframe therefore it's not lazy computed, to utilize these together you may use Spark parallelize.
For the Spark ADX connector -
This is indeed lazy loading. It is not evaluated until some evaluation method was requested - like the count.
If the count was done by Spark syntax i.e spark.read.kusto...count()
then it would cause all the data to be first brought into spark and then operate count on it - so 20 seconds sounds legit, to compare with a query simply change the value of "kustoQuery" to "some_table_in_adx | count" which will lead to a count on the ADX side - uploading to spark just the final int result.
The connector offers simple Kusto query or distributed export command via the readMode
option with ForceSingleMode
to perform a simple Kusto query from driver node - as explained in the docs
As Hive table operated over dataframes - once you create a dataframe from Kusto read - it will also be lazy evaluated with table operations.
Upvotes: 2