Reputation: 31
I following is my JSON Object for DLP API call to mask specific column of data on parquet file which is on a bucket on GCS. While calli dlp.deidentify_content() method i have to pass item to it, not sure how to pass parquet file, i have already mentioned parquet file path.
inspect_config = {
'info_types': info_types,
'custom_info_types': custom_info_types,
'min_likelihood': min_likelihood,
'limits': {'max_findings_per_request': max_findings},
}
actions = [{
'saveFindings': {
'outputConfig': {
'table': {
'projectId': project,
'datasetId': 1,
'tableId': "result1"
}
}
}
}]
# Construct a storage_config containing the file's URL.
url = 'gs://{}/{}'.format(bucket, filename)
storage_config = {
'cloud_storage_options': {
'file_set': {'url': url}
}
}
# Construct deidentify configuration dictionary
deidentify_config = {
"recordTransformations": {
"fieldTransformations": [
{
"fields": [
{
"name": "IP-address"
}
],
"primitiveTransformation": {
"cryptoHashConfig": {
"cryptoKey": {
"transient": {
"name": "[TRANSIENT-CRYPTO-KEY-1]"
}
}
}
}
},
{
"fields": [
{
"name": "comments"
}
],
"infoTypeTransformations": {
"transformations": [
{
"infoTypes": [
{
"name": "PHONE_NUMBER"
},
{
"name": "EMAIL_ADDRESS"
},
{
"name": "IP_ADDRESS"
}
],
"primitiveTransformation": {
"cryptoHashConfig": {
"cryptoKey": {
"transient": {
"name": "[TRANSIENT-CRYPTO-KEY-2]"
}
}
}
}
}
]
}
}
]
}
}
# Call the API
response = dlp.deidentify_content(
parent, inspect_config=inspect_config,
deidentify_config=deidentify_config, item=item)
What i am trying to accomplish is to mask parquet file which is on GCS bucket and mask few column and the stored the masked parquet file as table on BigQuery table.
Upvotes: 1
Views: 1920
Reputation: 806
Parquet files are currently scanned as binary objects, as the system does not parse them smartly yet. In the V2 api the supported file types are listed here
What you can do is load your parquet file from a bucket into bigquery as documented in this guide and then parse the data from bigquery with DLP API
Upvotes: 1