NeonOni
NeonOni

Reputation: 33

Parsing data from Polars LazyFrame

Pre-requisites: I'm collecting large amounts of data in CSV files with two columns. For storage and speed I'm trying to convert them to Parquet.

What I'm trying to achieve:

df = pl.scan_parquet('big_file.pq')  
results = []
htmls = df.select(["html"]).collect(streaming=True)
counter = 0
for item in htmls:
    counter += 1
    if counter == 3:
        break
    result = parser(item)
    results.append(result)

With this code I end up with a series in my htmls variable, and I don't know how to iterate through it, searched the docs but unfortunately couldn't find a solution.

Demo csv ( the csv that I'm converting to Parquet before going into parsing)

Upvotes: 0

Views: 1755

Answers (1)

Luca
Luca

Reputation: 1914

if I understand correctly your need, you can use the map_elements function for this.

Example:

import polars as pl

df = pl.scan_csv('test_5_lines.csv')

def html_udf(html_string):
    return {
        'a': html_string[:5],
        'b': html_string[5:10],
        'c': html_string[4:12]
    }

(
    df.select(
        pl.col('html').map_elements(html_udf))
    .unnest('html')
    .sink_csv('test_5_lines_result.csv')
)

# Here is the content of the resulting CSV file
a,b,c
CgoKC,goKIC,CgoKICA8
CgoKC,goKIC,CgoKICA8
CgoKC,goKIC,CgoKICA8
CgoKC,goKIC,CgoKICA8

Update: example with UDF returning a dict.

Upvotes: 2

Related Questions