alonisser
alonisser

Reputation: 12078

Databricks: Removing cluster logs and revisions on root DBFS on cron

Trying to investigate high databricks expenses I've discovered surprisingly that lots of those are actually an auto created storage account with GRS replication to another zone containing tons of log files (TB on TB of data) for example:

dbutils.fs.ls('dbfs:/cluster-logs')
dbfs:/cluster-logs/1129-093452-heard78

How can I automate removing this data on a daily basis without removing the logs from the last day or so

Also How can I send those logs to someplace else (If I want)

Upvotes: 3

Views: 732

Answers (1)

Alex Ott
Alex Ott

Reputation: 87174

One of the solutions would be to create a separate storage account without GRS option for logs only, and set retention period for files for specific amount of time, like, several days. This storage account should be mounted, and logs location changed to point to that mount. You can enforce that via cluster policies, for example.

Cluster logs could be sent to the Azure Log Analytics using the spark-monitoring from Microsoft (see official docs for more details). If you want to send them somewhere else, you can setup init scripts (cluster or global), and use specific client to ship logs whatever you need.

Upvotes: 2

Related Questions