Reputation: 161

Polars: Select first (or any) non-null struct column value row-wise

How can we select first (or any) non-null struct from polars dataframe to new column?

Input data:

structs_a = [
    {"a_key_1": 1, "a_key_2": "a" },
    {"a_key_1": None, "a_key_2": None }
]
structs_b = [
    {"b_key_1": None, "b_key_2": None },
    {"b_key_1": "b", "b_key_2": 2 }
]
df = pl.DataFrame([structs_a, structs_b])
df

┌─────────────┬─────────────┐
│ column_0    ┆ column_1    │
│ ---         ┆ ---         │
│ struct[2]   ┆ struct[2]   │
╞═════════════╪═════════════╡
│ {1,"a"}     ┆ {null,null} │
│ {null,null} ┆ {"b",2}     │
└─────────────┴─────────────┘

What I've tried so far: Concatenating values to list and dropping nulls, but for reasons I yet have to understand this results in list of struct[4]

df.with_columns(
    pl.concat_list(
        "column_0",
        "column_1",
    ).list.drop_nulls().alias('final_column')
)

My result:

shape: (2, 3)
┌─────────────┬─────────────┬─────────────────────────────────┐
│ column_0    ┆ column_1    ┆ final_column                    │
│ ---         ┆ ---         ┆ ---                             │
│ struct[2]   ┆ struct[2]   ┆ list[struct[4]]                 │
╞═════════════╪═════════════╪═════════════════════════════════╡
│ {1,"a"}     ┆ {null,null} ┆ [{1,"a",null,null}, {null,null… │
│ {null,null} ┆ {"b",2}     ┆ [{null,null,null,null}, {null,… │
└─────────────┴─────────────┴─────────────────────────────────┘

Desirable result:

┌─────────────┬─────────────┬─────────────────┐
│ column_0    ┆ column_1    ┆ final_column    │
│ ---         ┆ ---         ┆ ---             │
│ struct[2]   ┆ struct[2]   ┆ struct[2]       │
╞═════════════╪═════════════╪═════════════════╡
│ {1,"a"}     ┆ {null,null} ┆ {1,"a"}         │ <- Non-null value from column_0
│ {null,null} ┆ {"b",2}     ┆ {"b",2}         │ <- Non-null value from column_1
└─────────────┴─────────────┴─────────────────┘

UPD: Probably should experiment with coalesce but direct implementation did not result as desired:

df.with_columns(
    pl.coalesce(
        "column_0",
        "column_1",
    ).alias('final_column')
)

shape: (2, 3)
┌─────────────┬─────────────┬───────────────────────┐
│ column_0    ┆ column_1    ┆ final_column          │
│ ---         ┆ ---         ┆ ---                   │
│ struct[2]   ┆ struct[2]   ┆ struct[4]             │
╞═════════════╪═════════════╪═══════════════════════╡
│ {1,"a"}     ┆ {null,null} ┆ {1,"a",null,null}     │
│ {null,null} ┆ {"b",2}     ┆ {null,null,null,null} │
└─────────────┴─────────────┴───────────────────────┘

Upvotes: 5

Answers (2)

Hericks

Reputation: 10464

Since the last answer was posted, pl.coalesce was implemented to work with struct columns, which is likely the idiomatic solution here.

import polars as pl

df = pl.DataFrame({
    "a": [{"a_key_1": 1, "a_key_2": 2}, {"a_key_1": None, "a_key_2": None}],
    "b": [{"b_key_1": None, "b_key_2": None}, {"b_key_1": "x", "b_key_2": "y"}],
})

df.with_columns(
    pl.coalesce("a", "b").alias("c")
)

shape: (2, 3)
┌─────────────┬─────────────┬─────────────────────┐
│ a           ┆ b           ┆ c                   │
│ ---         ┆ ---         ┆ ---                 │
│ struct[2]   ┆ struct[2]   ┆ struct[4]           │
╞═════════════╪═════════════╪═════════════════════╡
│ {1,2}       ┆ {null,null} ┆ {1,2,null,null}     │
│ {null,null} ┆ {"x","y"}   ┆ {null,null,"x","y"} │
└─────────────┴─────────────┴─────────────────────┘

Note. In your example, the struct fields in columns a and b have different names. Therefore, the outcome column has 4 fields. If this is not desired, the struct fields can be made consistent first. Then, the resulting struct will also only have 2 fields.

Upvotes: 0

mrCatlost

Reputation: 236

Introduction

This question so concerning to dictionaries. I have been testing different ways to create a solution for your question, i got a fast solution that works well where i get the result you are searching for.

If you want to filter by null values, as you are using dictionaries you have to go deep into the keys and values of that dictionaries because if you are checking for nulls or nas values you´ll check the rows values.

First approach

A way that is going to works for sure is to iterate the dataframe and check the values inside of it, you can do it in some different ways i used shape function.

new_column = [] #Define the array variable

for i in range(df.shape[0]): #iterate over rows
    for j in range(df.shape[1]): #iterate over columns
        value = df.at[i, j] # get cell value
        values = value.values() # get the values from the cell value (Inside dictionary value)
        if not None in values: # check if there is some None values in the cell
            new_column.append(value) #Add the cell value into the array
      
df['final_column'] = new_column #Create the new column using the array with new values

OUTPUT:

    0   1   final_column
0   {'a_key_1': 1, 'a_key_2': 'a'}  {'a_key_1': None, 'a_key_2': None}  {'a_key_1': 1, 'a_key_2': 'a'}
1   {'b_key_1': None, 'b_key_2': None}  {'b_key_1': 'b', 'b_key_2': 2}  {'b_key_1': 'b', 'b_key_2': 2}

Second approach

Iterate by the DF in a different way

new_column = [] #Define the array variable

for rowIndex, row in df.iterrows(): #iterate over rows
    for columnIndex, value in row.items(): # iterate over columns
        values = value.values() # get the values from the cells
        if not None in values: # check if there is some None values in the cell
            new_column.append(value) #Add the cell value into the array
      
df6['final_column'] = new_column #Create the new column using the array with new values

This method give the same corret result.

Upvotes: 1