Paul Mazzuca
Paul Mazzuca

Reputation: 637

Python BigTable Client "And" Filter that will return all cells

I cannot seem to figure out how to create the correct filters to achieve what would be equivalent to the following MySQL query.

select * from TABLE where age >= 10 AND height < 60

In Python BigTable terms, my first thought is to use the snippet below, but it returns no results. Of note, I DO get results whenI use either filter alone, but when combining them I get no results.

Any help would be greatly appreciated.

# create the age filter
age_filt = btf.RowFilterChain([                                         
     btf.ColumnQualifierRegexFilter(COL_AGE),       
     btf.ValueRangeFilter(                                         
         start_value=struct.pack("Q", 10) )                                         
 ])

# create the height filter
height_filt = btf.RowFilterChain([                                         
     btf.ColumnQualifierRegexFilter(COL_HEIGHT),       
     btf.ValueRangeFilter(                                         
         end_value=struct.pack("Q", 60) )                                         
 ])

# THIS filter results in no results being returned
val_chain = btf.RowFilterChain([
    age_fil, height_filt
]) 

# this is a trick to get all the fields
# Maybe there is a better way
cond = btf.ConditionalRowFilter(                  
        base_filter=val_chain,                        
        true_filter=btf.PassAllFilter(True))          

# Read full table                                 
partial_rows = tbl.read_rows(                     
    start_key=start_key,                          
    end_key=end_key,                              
    filter_=cond                                  
    )                                                 

Upvotes: 1

Views: 1284

Answers (2)

Douglas McErlean
Douglas McErlean

Reputation: 582

The problem is that you're using a chain. Your age_filt will output an element from the age column which will then be filtered out by height_filt, because it only accepts elements from the height column.

To simulate an AND you need to do something a bit more arcane. Please excuse the pseudocode:


    Condition(
      filter=Chain(
        // Return exactly one item for each filter which passes
        Union(
          // Return exactly one item if the age filter passes
          Chain(Column(COL_AGE), ValueRange(>=10), CellsColumnLimit(1)),
          // Return exactly one item if the height filter passes
          Chain(Column(COL_HEIGHT), ValueRange(<60), CellsColumnLimit(1)),
        ),
        // Skip the first element, so we only emit something if both filters pass
        CellsRowOffset(1)),
      true=PassAll())

You can use a similar pattern for any sort of AND construct you might want to perform. This is, of course, a huge hassle. The reason things work this way in Bigtable is that the filter language is designed to be fully streamable. In other words, your nodes can compute this AND statement while holding only one cell in memory at a time, rather than buffering the whole row. Condition filters are the sole exception here, which is also why they tend to be slow.

Unrelated, but when doing a ValueRangeFilter on numbers always make sure to use fixed-width encoding. So, for example, if an age of 100 is possible, you should encode ages like 006, 052, etc, to make sure the filter works properly.

Upvotes: 2

Ramesh Dharan
Ramesh Dharan

Reputation: 895

I believe the issue here is that the ColumnQualifierRegexFilter only returns the cells from matching columns, not the entire row.

That's why the ConditionalRowFilter is needed. See also: How to get filtered data from Bigtable using Python?

Note that creating complex filter chains can negatively impact the performance of your queries and some filters are not as performant as others. In some cases simply retrieving a wider swath of data and filtering on the client side can yield better performance. Also if you find yourself needing to construct very complex SQL like queries consider whether Cloud Spanner or BigQuery may be a better fit for your use case.

Upvotes: 0

Related Questions