Reputation: 2238
I am trying to setup a column masking framework which allows enabling / disabling masking of column contents through a tag-based approach.
Each "relevant" column gets a tag indicating whether it's supposed to be masked or not.It also functions as a enable / disable toggle for masking. I can simply change the tag value to Y/N to indicate whether masking is enabled or not at that point in time; other factors notwithstanding.
In addition to this, I setup a masking function on the relevant columns which among, other things, whether the masking tag is ON or not.
What I am seeing is that the column masking function doesn't really know "its own context" - iow, when the function is executing, its aware of a few things, such as current_catalog()
and current_schema()
, but it doesn't really know the table / column combination for which the function was triggered. I need table and column names as well (so that I can extract the tag names/values for this column).
So far it seems that the databricks landscape doesn't allow to get that from "context" within function execution. I can't pass the context from the function call (from column DDL spec unless I define everything as a column in that table - not the cleanest way).
I might be over reaching the capabilities of the tool here, but my case of use is to get a generic data masking framework in place - which allows relatively simple toggle for masking of columns with sensitive information.
Thoughts?
Ideas?
Upvotes: 1
Views: 286
Reputation: 3250
As you mentioned you need the column masking framework which allows enabling / disabling masking of column contents through a tag-based approach.
I have created & defined a table that stores information about columns and whether they are sensitive?
CREATE TABLE column_tags (
table_name STRING,
column_name STRING,
is_sensitive BOOLEAN
);
I have tried the below approach:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
def mask_columns(schema_name, table_name, enable_masking=True):
columns = spark.sql(f'SHOW COLUMNS IN {schema_name}.{table_name}').collect()
column_names = [col["col_name"] for col in columns]
column_tags = spark.sql(f"SELECT column_name, is_sensitive FROM column_tags WHERE table_name = '{table_name}'")
sensitive_columns = {row["column_name"]: row["is_sensitive"] for row in column_tags.collect()}
describe_df = spark.sql(f'DESCRIBE {schema_name}.{table_name}')
column_types = {row["col_name"]: row["data_type"] for row in describe_df.collect() if "col_name" in row and "data_type" in row}
select_expr = []
for col_name in column_names:
is_sensitive = sensitive_columns.get(col_name, False)
if is_sensitive and enable_masking:
col_type = column_types.get(col_name, "string")
if col_type == "string":
select_expr.append(F.lit("MASKED").alias(col_name))
else:
select_expr.append(F.lit(None).cast(col_type).alias(col_name))
else:
select_expr.append(F.col(col_name))
df = spark.table(f"{schema_name}.{table_name}").select(*select_expr)
return df
df_masked = mask_columns("dbrx.default", "customers", enable_masking=True)
df_masked.show()
+-----------+-------+------+------+-----------+
|customer_id| name| email| ssn| address|
+-----------+-------+------+------+-----------+
| 3|narayan|MASKED|MASKED|789 Pine St|
| 1| dilip|MASKED|MASKED| 123 Elm St|
| 2| raj|MASKED|MASKED| 456 Oak St|
+-----------+-------+------+------+-----------+
Getting the column names of the table.
Getting sensitivity metadata from column_tags table for this table
Building a dictionary to map sensitive columns.
Then Get the data types of the columns using DESCRIBE
, and filter for the necessary columns.
Collecting only column names and data types as a dictionary
Generating the SELECT
expression with masking logic.
Check if column is marked as sensitive in the metadata.
Next,apply masking based on the column type
Applying the masking logic using the select expressions
Upvotes: 0