Reputation: 12078
Trying to investigate high databricks expenses I've discovered surprisingly that lots of those are actually an auto created storage account with GRS replication to another zone containing tons of log files (TB on TB of data) for example:
dbutils.fs.ls('dbfs:/cluster-logs')
dbfs:/cluster-logs/1129-093452-heard78
How can I automate removing this data on a daily basis without removing the logs from the last day or so
Also How can I send those logs to someplace else (If I want)
Upvotes: 3
Views: 732
Reputation: 87174
One of the solutions would be to create a separate storage account without GRS option for logs only, and set retention period for files for specific amount of time, like, several days. This storage account should be mounted, and logs location changed to point to that mount. You can enforce that via cluster policies, for example.
Cluster logs could be sent to the Azure Log Analytics using the spark-monitoring from Microsoft (see official docs for more details). If you want to send them somewhere else, you can setup init scripts (cluster or global), and use specific client to ship logs whatever you need.
Upvotes: 2