Reputation: 4864
I am trying to convert a csv file to parquet (I don't really care if it is done in python or command line, or...) In any case, this question addresses is, but the answers seem to require one to read the csv in first, and since in my case the csv is 17GB, this is not really feasible, so I would like some "offline" or streaming approach.
Upvotes: 5
Views: 1134
Reputation: 25
I successfully converted a 7GB+ (2.7 millions lines) CSV file into a parquet file, using csv2parquet.
The process is simple:
You can probably reproduce the process if you download our CSV export at https://world.openfoodfacts.org/data
# Not needed for you, just in case you want to reproduce
wget https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
csvclean -t en.openfoodfacts.org.products.csv
# Generate the schema
./csv2parquet --header true -p -n en.openfoodfacts.org.products_out.csv products_zstd.pqt > parquet.shema
# It has to be modified because column detection is sometimes wrong.
# From Open Food Facts CSV, for example, the code column is detected as a an Int64, but it's in fact a "Utf8".
nano parquet.shema
# Generate parquet file.
# Using -c for compression is optional.
# -c zstd appears to be the best option regarding speed/compression.
./csv2parquet --header true -c zstd -s parquet.schema en.openfoodfacts.org.products_out.csv products_zstd.pqt
# Try a query thanks to DuckDB. It's as fast as a database!
time ./duckdb test-duck.db "select * FROM (select count(data_quality_errors_tags) as products_with_issues from read_parquet('products_zstd.pqt') where data_quality_errors_tags != ''), (select count(data_quality_errors_tags) as products_with_issues_but_without_images from $db where data_quality_errors_tags != '' and last_image_datetime == '');"
┌──────────────────────┬─────────────────────────────────────────┐
│ products_with_issues │ products_with_issues_but_without_images │
├──────────────────────┼─────────────────────────────────────────┤
│ 29333 │ 4897 │
└──────────────────────┴─────────────────────────────────────────┘
real 0m0,211s
user 0m0,645s
sys 0m0,053s
Upvotes: 2