Reputation: 45
I could not find any helpful info on that so if anyone has some input.... please....
I need to split all string in a column in a big (ca 30 GB) csv file. For that I tried out polars. Seems to work fine but I dont understand how I can map the values from the list I get after splitting the string into separate columns with lazyframes. Any input?
What I have:
import polars as pl
# Create a LazyFrame
lf = pl.LazyFrame({"sample": ["ENST123456768.19", "ENST987654321.20", "ENST567890123.21"], "banananana": ["chiqita", "baaan", "banabana"]})
# Define a function to split the "sample" column and create new columns
lf = lf.with_columns(pl.struct(pl.col('sample').str.split('.'))).unnest('sample')
lf.sink_parquet('test.parquet')
test_Df = pl.read_parquet('test.parquet')
print(test_Df.head())
Output:
shape: (3, 2)
┌─────────────────────────┬────────────┐
│ sample ┆ banananana │
│ --- ┆ --- │
│ list[str] ┆ str │
╞═════════════════════════╪════════════╡
│ ["ENST123456768", "19"] ┆ chiqita │
│ ["ENST987654321", "20"] ┆ baaan │
│ ["ENST567890123", "21"] ┆ banabana │
└─────────────────────────┴────────────┘
What I try to do: the column "sample" should become 2 columns:
┌────────────────────┬────────────┬────────────┐
│ sample ┆ version ┆ banananana │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════════════════╪════════════╪════════════╡
│ ENST123456768 ┆ 19 │ chiqita │
│ ENST987654321 ┆ 20 │ baaan │
│ ENST567890123 ┆ 21 │ banabana │
└────────────────────┴────────────┴────────────┘
I thought how hard can it be to achieve with lazyframes. Yeah I already wasted more than 8 hours... So, its pretty complicated I guess? Any help is highly apreciated! Best regards
Upvotes: 2
Views: 1436
Reputation: 4152
You can achieve what you want using str.split_exact
and then unnest
:
import polars as pl
lf = pl.LazyFrame({"sample": ["ENST123456768.19", "ENST987654321.20", "ENST567890123.21"], "banananana": ["chiqita", "baaan", "banabana"]})
lf = lf.with_columns(pl.col('sample').str.split_exact('.', 1)).unnest("sample")
lf.sink_parquet('test.parquet')
test_Df = pl.read_parquet('test.parquet')
print(test_Df.head())
Output:
shape: (3, 3)
┌───────────────┬─────────┬────────────┐
│ field_0 ┆ field_1 ┆ banananana │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════════╪═════════╪════════════╡
│ ENST123456768 ┆ 19 ┆ chiqita │
│ ENST987654321 ┆ 20 ┆ baaan │
│ ENST567890123 ┆ 21 ┆ banabana │
└───────────────┴─────────┴────────────┘
You can then easily rename the columns if you need
As suggested in the comments here is the code to rename the columns and cast the version
column to uint:
...
lf.with_columns(pl.col('sample').str.split_exact('.', 1)).unnest("sample").rename({'field_0':'sample', 'field_1': 'version'}) # rename the columns
lf = lf.with_columns(pl.col('version').cast(pl.UInt8)) # cast version to uint
lf.sink_parquet('test.parquet')
...
Upvotes: 2