Reputation: 161
How can we select first (or any) non-null struct from polars dataframe to new column?
Input data:
structs_a = [
{"a_key_1": 1, "a_key_2": "a" },
{"a_key_1": None, "a_key_2": None }
]
structs_b = [
{"b_key_1": None, "b_key_2": None },
{"b_key_1": "b", "b_key_2": 2 }
]
df = pl.DataFrame([structs_a, structs_b])
df
┌─────────────┬─────────────┐
│ column_0 ┆ column_1 │
│ --- ┆ --- │
│ struct[2] ┆ struct[2] │
╞═════════════╪═════════════╡
│ {1,"a"} ┆ {null,null} │
│ {null,null} ┆ {"b",2} │
└─────────────┴─────────────┘
What I've tried so far: Concatenating values to list and dropping nulls, but for reasons I yet have to understand this results in list of struct[4]
df.with_columns(
pl.concat_list(
"column_0",
"column_1",
).list.drop_nulls().alias('final_column')
)
My result:
shape: (2, 3)
┌─────────────┬─────────────┬─────────────────────────────────┐
│ column_0 ┆ column_1 ┆ final_column │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[2] ┆ list[struct[4]] │
╞═════════════╪═════════════╪═════════════════════════════════╡
│ {1,"a"} ┆ {null,null} ┆ [{1,"a",null,null}, {null,null… │
│ {null,null} ┆ {"b",2} ┆ [{null,null,null,null}, {null,… │
└─────────────┴─────────────┴─────────────────────────────────┘
Desirable result:
┌─────────────┬─────────────┬─────────────────┐
│ column_0 ┆ column_1 ┆ final_column │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═════════════╪═════════════╪═════════════════╡
│ {1,"a"} ┆ {null,null} ┆ {1,"a"} │ <- Non-null value from column_0
│ {null,null} ┆ {"b",2} ┆ {"b",2} │ <- Non-null value from column_1
└─────────────┴─────────────┴─────────────────┘
UPD: Probably should experiment with coalesce but direct implementation did not result as desired:
df.with_columns(
pl.coalesce(
"column_0",
"column_1",
).alias('final_column')
)
shape: (2, 3)
┌─────────────┬─────────────┬───────────────────────┐
│ column_0 ┆ column_1 ┆ final_column │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[2] ┆ struct[4] │
╞═════════════╪═════════════╪═══════════════════════╡
│ {1,"a"} ┆ {null,null} ┆ {1,"a",null,null} │
│ {null,null} ┆ {"b",2} ┆ {null,null,null,null} │
└─────────────┴─────────────┴───────────────────────┘
Upvotes: 5
Views: 3950
Reputation: 10464
Since the last answer was posted, pl.coalesce
was implemented to work with struct columns, which is likely the idiomatic solution here.
import polars as pl
df = pl.DataFrame({
"a": [{"a_key_1": 1, "a_key_2": 2}, {"a_key_1": None, "a_key_2": None}],
"b": [{"b_key_1": None, "b_key_2": None}, {"b_key_1": "x", "b_key_2": "y"}],
})
df.with_columns(
pl.coalesce("a", "b").alias("c")
)
shape: (2, 3)
┌─────────────┬─────────────┬─────────────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[2] ┆ struct[4] │
╞═════════════╪═════════════╪═════════════════════╡
│ {1,2} ┆ {null,null} ┆ {1,2,null,null} │
│ {null,null} ┆ {"x","y"} ┆ {null,null,"x","y"} │
└─────────────┴─────────────┴─────────────────────┘
Note. In your example, the struct fields in columns a
and b
have different names. Therefore, the outcome column has 4 fields. If this is not desired, the struct fields can be made consistent first. Then, the resulting struct will also only have 2 fields.
Upvotes: 0
Reputation: 236
This question so concerning to dictionaries. I have been testing different ways to create a solution for your question, i got a fast solution that works well where i get the result you are searching for.
If you want to filter by null values, as you are using dictionaries you have to go deep into the keys and values of that dictionaries because if you are checking for nulls or nas values you´ll check the rows values.
A way that is going to works for sure is to iterate the dataframe and check the values inside of it, you can do it in some different ways i used shape
function.
new_column = [] #Define the array variable
for i in range(df.shape[0]): #iterate over rows
for j in range(df.shape[1]): #iterate over columns
value = df.at[i, j] # get cell value
values = value.values() # get the values from the cell value (Inside dictionary value)
if not None in values: # check if there is some None values in the cell
new_column.append(value) #Add the cell value into the array
df['final_column'] = new_column #Create the new column using the array with new values
0 1 final_column
0 {'a_key_1': 1, 'a_key_2': 'a'} {'a_key_1': None, 'a_key_2': None} {'a_key_1': 1, 'a_key_2': 'a'}
1 {'b_key_1': None, 'b_key_2': None} {'b_key_1': 'b', 'b_key_2': 2} {'b_key_1': 'b', 'b_key_2': 2}
Iterate by the DF in a different way
new_column = [] #Define the array variable
for rowIndex, row in df.iterrows(): #iterate over rows
for columnIndex, value in row.items(): # iterate over columns
values = value.values() # get the values from the cells
if not None in values: # check if there is some None values in the cell
new_column.append(value) #Add the cell value into the array
df6['final_column'] = new_column #Create the new column using the array with new values
This method give the same corret result.
Upvotes: 1