Parsing data from Polars LazyFrame

Question

Pre-requisites: I'm collecting large amounts of data in CSV files with two columns. For storage and speed I'm trying to convert them to Parquet.

What I'm trying to achieve:

Read a parquet file in a LazyFrame.
Iterate through every cell of a column
Extract the data from there, feed it to a function.
Save the output of the function in a list (the output is a dict)
Write the dicts to a new CSV file (I would also prefer to do this in a streaming way, because I can't hold in memory all the results)

df = pl.scan_parquet('big_file.pq')  
results = []
htmls = df.select(["html"]).collect(streaming=True)
counter = 0
for item in htmls:
    counter += 1
    if counter == 3:
        break
    result = parser(item)
    results.append(result)

With this code I end up with a series in my htmls variable, and I don't know how to iterate through it, searched the docs but unfortunately couldn't find a solution.

Demo csv ( the csv that I'm converting to Parquet before going into parsing)

Luca · Accepted Answer

if I understand correctly your need, you can use the map_elements function for this.

Example:

import polars as pl

df = pl.scan_csv('test_5_lines.csv')

def html_udf(html_string):
    return {
        'a': html_string[:5],
        'b': html_string[5:10],
        'c': html_string[4:12]
    }

(
    df.select(
        pl.col('html').map_elements(html_udf))
    .unnest('html')
    .sink_csv('test_5_lines_result.csv')
)

# Here is the content of the resulting CSV file
a,b,c
CgoKC,goKIC,CgoKICA8
CgoKC,goKIC,CgoKICA8
CgoKC,goKIC,CgoKICA8
CgoKC,goKIC,CgoKICA8

Update: example with UDF returning a dict.

Parsing data from Polars LazyFrame

Answers (1)

Related Questions