Reputation: 110143
From the following intro to BQ BigLake, it gives the following picture:
What exactly do the following three things mean in practice:
I believe predicate pushdown is the ability to evaluate the WHERE
expression directly on the data source, so for example if I was running the following query on a MySQL table:
SELECT * FROM table WHERE country="US"
Predicate pushdown would be taking the country='US'
and evaluating that at the native source, whereas if we didn't have the predicate pushdown (in this made up example), it would involve running SELECT * FROM table
-- copying all the data over to the processing server -- and then evaluating the predicate there. But if this is correct, how would, for example, predicate pushdown work on a csv file?
Also, what would row filter evaluation be? It seems like predicate pushdown and row filter evaluation would be the same thing, but maybe one of these means something more along the lines of "column filter" -- i.e., only bringing over data from the columns that are needed.
And finally, what are 'fast scans'? I suppose this just means parallelized reading of multiple files, or a file that can be read in parallel (such as a json-newline file). Is the actual reading of the file any different than if I were, for example, to copy over a file with a cli tool such as $ s3 copy myfile .
?
Upvotes: 0
Views: 1096
Reputation: 7947
I will try to address each feature independently.
row filtering
Also, what would row filter evaluation be? It seems like predicate pushdown and row filter evaluation would be the same thing, but maybe one of these means something more along the lines of "column filter" -- i.e., only bringing over data from the columns that are needed.
BigQuery security model allows to grant access at row level based on filter conditions. Row filtering refers to the BigLake
engine capability to enforce your access control rules at row level at runtime during reading where your defined policies will be applied. This is a useful strategy to avoid data duplication or unnecessary views creation for data governance purposes since you can define multiple rules on a single table depending on your needs
predicate pushdown
Predicate pushdown would be taking the country='US' and evaluating that at the native source, whereas if we didn't have the predicate pushdown (in this made up example), it would involve running SELECT * FROM table -- copying all the data over to the processing server -- and then evaluating the predicate there. But if this is correct, how would, for example, predicate pushdown work on a csv file?
I am not sure what you mean by applying the predicate (where
clause or filter) in the source, as far as I know this is not possible, specially in a distributed/cloud environment where storage and compute are usually decoupled.
Predicate pushdown refers to the capability of applying the predicate in the BQ vectorized runtime instead of the BigQuery IO storage API or the client, which will reduce the amount of data to be processed and sent over the network. This is also especially useful when you have complex queries involving joins that would require more computation if the filters are applied after the join operations instead of before.
fast scans
And finally, what are 'fast scans'? I suppose this just means parallelized reading of multiple files, or a file that can be read in parallel (such as a json-newline file). Is the actual reading of the file any different than if I were, for example, to copy over a file with a cli tool such as $ s3 copy myfile . ?
This refers to the fact that data can be consumed through the BigQuery Storage API instead of the "conventional" BigQuery API that returns responses in a paginated fashion. More detail can be found here. The advantage is, as you mention, it will read the source data in parallel (which will always do from distributed storage systems) but also you will be able to process the data in a more efficient way (distributed/parallel) depending on how you consume it, e.g. DataFlow, Apache Spark, TensorFlow.
Upvotes: 2