Reputation: 661
I am trying to populate a new field containing a struct of all of the other fields from another lazyframe based on a predicate.
While the examples are in python, I am open to answers in python or rust.
companies = pl.DataFrame({
"id": [1],
"name": ["google"],
"industry": [1001]
}).lazy()
industries = pl.DataFrame({
"id": [1001],
"name": ["tech"],
"sectors": [[10011, 10012]]
}).lazy()
expected = pl.DataFrame({
"id": [1],
"name": ["polars"],
"industry": [{
"name": "tech",
"sectors": [[10011, 10012]]
}]
})
I can do this naively through join -> select -> to_struct -> lit, but that is not ideal as I have to perform a collect on the lazyframe to get my desired results.
right_columns = industries.schema.keys()
industry_matches = companies.join(
industries,
left_on="industry",
right_on="id",
how="inner"
).collect().select(right_columns).drop("id").to_struct("industry")
df = companies.with_columns(pl.lit(industry_matches))
df.collect()
Ideally, i'd like to do something like a join within an expression like so.
companies.select(
pl.all(),
pl.col("industry").join(industries, on="id").exclude("id"),
)
but I am open to any alternatives that do not require a collect
Upvotes: 4
Views: 1331
Reputation:
Would this work for you:
(
companies
.join(
industries
.select(
pl.col('id').alias('industry'),
pl.struct(pl.exclude('id')).alias('industry_struct'),
),
on="industry",
how="inner"
)
.drop('industry')
.collect()
)
shape: (1, 3)
┌─────┬────────┬─────────────────────────┐
│ id ┆ name ┆ industry_struct │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ struct[2] │
╞═════╪════════╪═════════════════════════╡
│ 1 ┆ google ┆ {"tech",[10011, 10012]} │
└─────┴────────┴─────────────────────────┘
In essence, we are creating the struct within the join itself.
Upvotes: 2