Reputation: 33
Pre-requisites: I'm collecting large amounts of data in CSV files with two columns. For storage and speed I'm trying to convert them to Parquet.
What I'm trying to achieve:
df = pl.scan_parquet('big_file.pq')
results = []
htmls = df.select(["html"]).collect(streaming=True)
counter = 0
for item in htmls:
counter += 1
if counter == 3:
break
result = parser(item)
results.append(result)
With this code I end up with a series in my htmls variable, and I don't know how to iterate through it, searched the docs but unfortunately couldn't find a solution.
Demo csv ( the csv that I'm converting to Parquet before going into parsing)
Upvotes: 0
Views: 1755
Reputation: 1914
if I understand correctly your need, you can use the map_elements
function for this.
Example:
import polars as pl
df = pl.scan_csv('test_5_lines.csv')
def html_udf(html_string):
return {
'a': html_string[:5],
'b': html_string[5:10],
'c': html_string[4:12]
}
(
df.select(
pl.col('html').map_elements(html_udf))
.unnest('html')
.sink_csv('test_5_lines_result.csv')
)
# Here is the content of the resulting CSV file
a,b,c
CgoKC,goKIC,CgoKICA8
CgoKC,goKIC,CgoKICA8
CgoKC,goKIC,CgoKICA8
CgoKC,goKIC,CgoKICA8
Update: example with UDF returning a dict.
Upvotes: 2