Balaji Venkatachalam
Balaji Venkatachalam

Reputation: 363

How to access Polars Dataframe containing Struct Type column fields using SQLContext?

I'm trying to execute SQL query on Polars dataframes using SQLContext and below is my code:

 ctx = pl.SQLContext().register_many(
        {"tbl1": df_source, "tbl2": df_reference})

src_df = ctx.execute(pl_sql_query, eager=True)

Here the schema of df_source contains a column named json_message of type Struct with Key Value pairs i.e.,

('json_message', Struct({'id': Int64, 'name': String, 'age': Int64, 'dob': String}))

My sql query to access the struct field is:

pl_sql_query = 
"select json_message.id as id, json_message.name as name 
from tbl1 
where json_message.id in (select id from tbl2)"

When i execute this query, i'm getting an exception no table or alias named 'json_message' found

Not sure how exactly we need to access the struct field value. Tried struct.with_fields but not able to access the value.

Can someone please help me on this?

Upvotes: 4

Views: 541

Answers (2)

alexander-beedie
alexander-beedie

Reputation: 881

Thanks for making the issue; I have just landed a PR that addresses this, enabling Polars struct-field access via the SQL interface: https://github.com/pola-rs/polars/pull/17109 👍

Upvotes: 6

Henry Harbeck
Henry Harbeck

Reputation: 1273

Unfortunately this doesn't look to be possible in Polars at the moment. However, the SQL functionality is being quite actively developed, so I would think this would be an accepted feature request. I will make it over the next few days if you don't get there first.

EDIT: I see that you've created a feature request and already mentioned there that you are using DuckDB as a workaround.

As for a workaround until (hopefully) this is implemented natively in Polars, DuckDB is probably the best option.

import duckdb
import polars as pl

df_source = pl.DataFrame({
    "json_message": {
        "id": 1,
        "name": "Manuel Neuer",
        "age": 38,
        "dob": "1986-03-27",
    }
})
df_reference = pl.DataFrame({"id": 1})

# I updated the SQL to refer to `df_source` and `df_reference`
# as DuckDB checks variables in scope. `.pl()` converts back to Polars.
duckdb.sql(
    """
    select json_message.id as id, json_message.name as name 
    from df_source 
    where json_message.id in (select id from df_reference)
    """
).pl()
# ┌─────┬──────────────┐
# │ id  ┆ name         │
# │ --- ┆ ---          │
# │ i64 ┆ str          │
# ╞═════╪══════════════╡
# │ 1   ┆ Manuel Neuer │
# └─────┴──────────────┘

P.S. Including a fully reproducible example will save people having to come up with dummy data and will make it even easier to help you!

Upvotes: 3

Related Questions