Reputation: 83
I'm trying to build a model in DBT that flattens out a struct with name properties
that contains about a hundred structs inside it (e.g. property1
, property2
, etc.), each with 5 different columns of which I want to extract one called value
. I could type properties.propertyX.value
100 times, but I figured I could try to find a way to loop through each struct within properties
and obtain propertyX.value
within a SELECT
statement by using Jinja, but I guess I'm either unfamiliar with the syntax or its limitations because I don't know what to do. I've tried:
WITH t as (
SELECT
properties
FROM
src
)
SELECT
{% for property in properties %}
{{property}}.value
{% endfor %}
{%- if not loop.last %},{% endif -%}
FROM
t
but I realized I have to set properties
as a variable and I don't really know how to do that in a way that it references the individual properties in the properties
struct. Anyway, I'm quite lost and if someone could help I would be so grateful.
Upvotes: 6
Views: 22524
Reputation: 662
Not sure if I followed 100% your data structure, but let's say it's similar to this:
{
"properties": {
"property1": {
"column1": "...",
"column2": "...",
"column3": "...",
"value": "my value 1.0"
},
"property2": {
"column1": "...",
"column2": "...",
"column3": "...",
"value": "my value 2.0"
},
"propertyX": {
"column1": "...",
"column2": "...",
"column3": "...",
"value": "my value 3.0"
}
}
}
As you mention, you need to use set
to create variables and to be able to manipulate your data. Personally, I like to create different variables to handle the query statement
, the query result
, and the query values
. So following this strategy you would have something like this:
{% set data_structure_query %}
select properties from src
{% endset %}
{% set results = run_query(data_structure_query) %}
{% set properties = results.columns[0].values() %}
Note that
results.columns[0].values()
will bring the data of the first column of your query which in this case it's theproperties
.
The .values()
get the values of the column as a tuple, where the items most of the time is defined as a string
. So in order to access the properties of your data, you will have to deserialize the json string into a Python object, e.g. dict
. To do that, you need to use the fromjson
method:
...
{% set properties = results.columns[0].values() %}
{% set properties_dict = fromjson(properties[0]) %}
...
Assuming your query return only one row with the JSON format, I specified the
properties[0]
to access the first row of the result query.
Before jumping to the next step, it's important to know that dbt has a jinja variable that informs us when dbt is in the "execute mode". That's a thing we need to worry about since it can raise issues to build our models. In short, any jinja that relies on a result being returned from the database will throw an error.
In your case, the results
variable is depending on a value that needs to be executed in the database which means if you just try to run the model most likely you'll get an issue with Compilation Error
. To avoid that, you need to add an if condition
to check if dbt is in "execute mode" or not:
...
{% set results = run_query(data_structure_query) %}
{% if execute %}
{% set properties = results.columns[0].values() %}
{% set properties_dict = fromjson(properties[0]) %}
{% else %}
{% set properties = [] %}
{% set properties_dict = [] %}
{% endif %}
...
Finally, you can proceed with a loop
to build your columns:
select
{%- for property in properties_dict.properties %}
{{ property }}.value
{%- if not loop.last %},{% endif -%}
{%- endfor %}
from
...
This will be compiled to:
select
property1.value,
property2.value,
propertyX.value
from
...
If you want to access the values for each column, then:
select
{%- for property in properties_dict.properties %}
'{{ properties_dict.properties[property].value }}'
{%- if not loop.last %},{% endif -%}
{%- endfor %}
from
...
Which will be compiled to:
select
'my value',
'my value 1.0',
'my value 2.0'
from
...
Might be worth taking a look in your database/warehouse and check if there is any internal function that handles semi-structured data. This can help you with the logic too. For instance, Snowflake has the lateral flatten
that does a similar behavior to split the properties into multiple rows.
For debug purposes, I recommend to compile
your model and use the logs ({{ log('my message', info=True) }}
) to understand how dbt/jinja is processing the data. Some of the code I provided might change depending on the output of your query.
https://docs.getdbt.com/reference/dbt-jinja-functions/run_query
https://docs.getdbt.com/reference/dbt-jinja-functions/execute
https://docs.getdbt.com/reference/dbt-jinja-functions/fromjson/
https://docs.getdbt.com/tutorial/using-jinja
Upvotes: 7
Reputation: 550
Assuming your data structure looks like this:
{
"properties": [
{
"value": "Value 1"
},
{
"value": "Value 2"
},
...
]
}
You just need to move the .value lookup into the variable delimiters: {{ property.value }}
Upvotes: 0