Reputation: 71
I am new to Google Cloud DLP and I ran a POST https://dlp.googleapis.com/v2beta1/inspect/operations to scan a .parquet
file within a Google Cloud Storage directory and also using cloudStorageOptions
to save the .csv
output.
The .parquet
file is 53.93 M.
When I make the API call on the .parquet
file I get :
"processedBytes": "102308122",
"infoTypeStats": [{
"infoType": {
"name": "AMERICAN_BANKERS_CUSIP_ID"
},
"count": "1"
}, {
"infoType": {
"name": "IP_ADDRESS"
},
"count": "17"
}, {
"infoType": {
"name": "US_TOLLFREE_PHONE_NUMBER"
},
"count": "148"
}, {
"infoType": {
"name": "EMAIL_ADDRESS"
},
"count": "30"
}, {
"infoType": {
"name": "US_STATE"
},
"count": "22"
}]
When I convert the .parquet
file to .csv
I get a 360.58 MB file. Then if I make the API call on the .csv
file I get:
"processedBytes": "377530307",
"infoTypeStats": [{
"infoType": {
"name": "CREDIT_CARD_NUMBER"
},
"count": "56546"
}, {
"infoType": {
"name": "EMAIL_ADDRESS"
},
"count": "372527"
}, {
"infoType": {
"name": "NETHERLANDS_BSN_NUMBER"
},
"count": "5"
}, {
"infoType": {
"name": "US_TOLLFREE_PHONE_NUMBER"
},
"count": "1331321"
}, {
"infoType": {
"name": "AUSTRALIA_TAX_FILE_NUMBER"
},
"count": "52269"
}, {
"infoType": {
"name": "PHONE_NUMBER"
},
"count": "28"
}, {
"infoType": {
"name": "US_DRIVERS_LICENSE_NUMBER"
},
"count": "114"
}, {
"infoType": {
"name": "US_STATE"
},
"count": "141383"
}, {
"infoType": {
"name": "KOREA_RRN"
},
"count": "56144"
}],
Obviously when I scan the .parquet
file not all the infoTypes
are detected compared to running the scan on the .csv
file where I verified that all EmailAddresses
were detected.
I couldn't find any documentation on compressed files such as parquet, so I am assuming that Google Cloud DLP doesn't offer this capability.
Any help would be greatly appreciated.
Upvotes: 3
Views: 875
Reputation: 995
Parquet files are currently scanned as binary objects, as the system does not parse them smartly yet. In the V2 api the supported file types are listed here https://cloud.google.com/dlp/docs/reference/rpc/google.privacy.dlp.v2#filetype.
Upvotes: 2