Reputation: 18446
Let's say I have a list of dataframes list this:
Ldfs=[
pl.DataFrame({'a':[1.0,2.0,3.1], 'b':[2,3,4]}),
pl.DataFrame({'b':[1,2,3], 'c':[2,3,4]}),
pl.DataFrame({'a':[1,2,3], 'c':[2,3,4]})
]
I can't do pl.concat(Ldfs)
because they don't all have the same columns and even the ones that have a
in common don't have the same data type.
What I'd like to do is concat them together but just add a column of Nones whenever a column isn't there and to cast columns to a fixed datatype.
For instance, just taking the first element of the list I'd like to have something like this work:
Ldfs[0].select(pl.when(pl.col('c')).then(pl.col('c')).otherwise(None).cast(pl.Float64).alias('c'))
of course, this results in ColumnNotFoundError: c
Upvotes: 2
Views: 687
Reputation:
Would an approach like this work for you. (I'll convert your DataFrames to LazyFrames for added fun.)
Ldfs = [
pl.DataFrame({"a": [1.0, 2.0, 3.1], "b": [2, 3, 4]}).lazy(),
pl.DataFrame({"b": [1, 2, 3], "c": [2, 3, 4]}).lazy(),
pl.DataFrame({"a": [1, 2, 3], "c": [2, 3, 4]}).lazy(),
]
my_schema = {
"a": pl.Float64,
"b": pl.Int64,
"c": pl.UInt32,
}
def fix_schema(ldf: pl.LazyFrame) -> pl.LazyFrame:
ldf = (
ldf.with_columns(
pl.col(col_nm).cast(col_type)
for col_nm, col_type in my_schema.items()
if col_nm in ldf.columns
)
.with_columns(
pl.lit(None, dtype=col_type).alias(col_nm)
for col_nm, col_type in my_schema.items()
if col_nm not in ldf.columns
)
.select(my_schema.keys())
)
return ldf
pl.concat([fix_schema(next_frame)
for next_frame in Ldfs], how="vertical").collect()
shape: (9, 3)
┌──────┬──────┬──────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ u32 │
╞══════╪══════╪══════╡
│ 1.0 ┆ 2 ┆ null │
│ 2.0 ┆ 3 ┆ null │
│ 3.1 ┆ 4 ┆ null │
│ null ┆ 1 ┆ 2 │
│ null ┆ 2 ┆ 3 │
│ null ┆ 3 ┆ 4 │
│ 1.0 ┆ null ┆ 2 │
│ 2.0 ┆ null ┆ 3 │
│ 3.0 ┆ null ┆ 4 │
└──────┴──────┴──────┘
Upvotes: 1