inquisitiveProgrammer
inquisitiveProgrammer

Reputation: 992

Python - read parquet file without pandas

Currently I'm using the code below on Python 3.5, Windows to read in a parquet file.

import pandas as pd

parquetfilename = 'File1.parquet'
parquetFile = pd.read_parquet(parquetfilename, columns=['column1', 'column2'])  

However, I'd like to do so without using pandas. How to best do this? I'm using both Python 2.7 and 3.6 on Windows.

Upvotes: 6

Views: 3450

Answers (1)

You can use duckdb for this. It's an embedded RDBMS similar to SQLite but with OLAP in mind. There's a nice Python API and a SQL function to import Parquet files:

import duckdb

conn = duckdb.connect(":memory:") # or a file name to persist the DB

# Keep in mind this doesn't support partitioned datasets,
# so you can only read one partition at a time
conn.execute("CREATE TABLE mydata AS SELECT * FROM parquet_scan('/path/to/mydata.parquet')")

# Export a query as CSV
conn.execute("COPY (SELECT * FROM mydata WHERE col = 'val') TO 'col_val.csv' WITH (HEADER 1, DELIMITER ',')")

Upvotes: 1

Related Questions