Suraj
Suraj

Reputation: 675

REST API to query Databricks table

I have a usecase and needed help with the best available approach.

I use Azure databricks to create data transformations and create table in the presentation layer/gold layer. The underlying data in these tables are in Azure Storage account.

The transformation logic runs twice daily and updates the gold layer tables.

I have several such tables in the gold layer Eg: a table to store Single customer view data.

An external application from a different system needs access to this data i.e. the application would initiate an API call for details regarding a customer and need to send back the response for matching details (customer details) by querying the single customer view table.

Question:

  1. Is databricks SQL API the solution for this?

  2. As it is a spark table, the response will not be quick i assume. Is this correct or is there a better solution for this.

  3. Is databricks designed for such use cases or is a better approach to copy this table (gold layer) in an operational database such as azure sql db after the transformations are done in pyspark via databricks?

  4. What are the cons of this approach? One would be the databricks cluster should be up and running all time i.e. use interactive cluster. Anything else?

Upvotes: 2

Views: 11194

Answers (2)

Alex Ott
Alex Ott

Reputation: 87279

Update: April 2023rd. There is a new SQL Execution API for querying Databricks SQL tables via REST API.

It's possible to use Databricks for that, although it heavily dependent on the SLAs - how fast should be response. Answering your questions in order:

  1. There is no standalone API for execution of queries and getting back results (yet). But you can create a thin wrapper using one of the drivers to work with Databricks: Python, Node.js, Go, or JDBC/ODBC.

  2. Response time heavily dependent on the size of the data, and if the data is already cached on the nodes, and other factors (partitioning of the data, data skipping, etc.). Databricks SQL Warehouses are also able to cache results of queries execution so they won't reprocess the data if such query was already executed.

  3. Storing data in operational databases is also one of the approaches that often used by different customers. But it heavily dependent on the size of the data, and other factors - if you have huge gold layer, then SQL databases may also not the best solution from cost/performance perspective.

  4. For such queries it's recommended to use Databricks SQL that is more cost efficient that having always running interactive cluster. Also, on some of the cloud platforms there is already support for serverless Databricks SQL, where the startup time is very short (seconds instead of minutes), so if your queries to gold layer doesn't happen very often, you may have them configured with auto-termination, and pay only when they are used.

Upvotes: 3

Mark Derry
Mark Derry

Reputation: 41

There is a new release that now makes 1. possible.

Here is the documentation: https://docs.databricks.com/sql/api/sql-execution-tutorial.html

Here is a python example:

import os
import requests
import json

db_token = os.environ.get('DATABRICKS_TOKEN')

# set databricks host
db_host = "https://<your databricks host>.databricks.com"

# sql query
warehouse_id = '<your sql warehouse id>'
query = "SELECT * from some_table;"

query_resp = requests.post(db_host + "/api/2.0/sql/statements", headers={"Authorization": "Bearer " + db_token}, data=json.dumps({"statement": query, "warehouse_id": warehouse_id}))
print(query_resp.status_code)
print(query_resp.json()['result']['data_array'][0])

Upvotes: 2

Related Questions