Igor Rivin
Igor Rivin

Reputation: 4864

Converting a very very large csv to parquet

I am trying to convert a csv file to parquet (I don't really care if it is done in python or command line, or...) In any case, this question addresses is, but the answers seem to require one to read the csv in first, and since in my case the csv is 17GB, this is not really feasible, so I would like some "offline" or streaming approach.

Upvotes: 5

Views: 1134

Answers (1)

CharlesNepote
CharlesNepote

Reputation: 25

I successfully converted a 7GB+ (2.7 millions lines) CSV file into a parquet file, using csv2parquet.

The process is simple:

  • First I had to clean my CSV with csvclean from csvkit (but you might not need this)
  • Generate a JSON schema with csv2parquet
  • Edit the schema by hand, as it might not suit you
  • Generate the parquet file thanks to csv2parquet
  • Bonus: use DuckDB to test simple SQL queries directly on the parquet file

You can probably reproduce the process if you download our CSV export at https://world.openfoodfacts.org/data

# Not needed for you, just in case you want to reproduce
wget https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
csvclean -t en.openfoodfacts.org.products.csv

# Generate the schema
./csv2parquet --header true -p -n en.openfoodfacts.org.products_out.csv products_zstd.pqt > parquet.shema
# It has to be modified because column detection is sometimes wrong.
# From Open Food Facts CSV, for example, the code column is detected as a an Int64, but it's in fact a "Utf8".
nano parquet.shema

# Generate parquet file.
# Using -c for compression is optional.
# -c zstd appears to be the best option regarding speed/compression.
./csv2parquet --header true -c zstd -s parquet.schema en.openfoodfacts.org.products_out.csv products_zstd.pqt

# Try a query thanks to DuckDB. It's as fast as a database!
time ./duckdb test-duck.db "select * FROM (select count(data_quality_errors_tags) as products_with_issues from read_parquet('products_zstd.pqt') where data_quality_errors_tags != ''), (select count(data_quality_errors_tags) as products_with_issues_but_without_images from $db where data_quality_errors_tags != '' and last_image_datetime == '');"
┌──────────────────────┬─────────────────────────────────────────┐
│ products_with_issues │ products_with_issues_but_without_images │
├──────────────────────┼─────────────────────────────────────────┤
│ 29333                │ 4897                                    │
└──────────────────────┴─────────────────────────────────────────┘

real    0m0,211s
user    0m0,645s
sys 0m0,053s

Upvotes: 2

Related Questions