Reputation: 8203
How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.
Upvotes: 190
Views: 553354
Reputation: 301
you can use python to get parquet data
install package
pip install pandas pyarrow
read file
def read_parquet(file):
result = []
data = pd.read_parquet(file)
for index in data.index:
res = data.loc[index].values[0:-1]
result.append(res)
print(len(result))
file = "./data.parquet"
read_parquet(file)
Upvotes: 0
Reputation: 13582
Considering the .parquet
file named data.parquet
parquet_file = '../data.parquet'
open( parquet_file, 'w+' )
Assuming one has a dataframe parquet_df
that one wants to save to the parquet file above, one can use pandas.to_parquet
(this function requires either the fastparquet or pyarrow library) as follows
parquet_df.to_parquet(parquet_file)
In order to read the parquet file into a dataframe new_parquet_df
, one can use pandas.read_parquet()
as follows
new_parquet_df = pd.read_parquet(parquet_file)
Upvotes: 1
Reputation: 37930
pandas 0.21 introduces new functions for Parquet:
import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')
or
import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')
The above link explains:
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).
Upvotes: 246
Reputation: 359
df = pd.DataFrame({
'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
'marks': [20,10,22,21,22],
})
df.to_parquet('sample.parquet')
df = pd.read_parquet('sample.parquet')
Upvotes: 15
Reputation: 8931
When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.
df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')
Upvotes: 4
Reputation: 115
Parquet files are always large. so read it using dask.
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
files = glob.glob('data/*.parquet')
@delayed
def load_chunk(path):
return ParquetFile(path).to_pandas()
df = dd.from_delayed([load_chunk(f) for f in files])
df.compute()
Upvotes: 2
Reputation: 1905
Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe
The code is simple, just type:
import pyarrow.parquet as pq
df = pq.read_table(source=your_file_path).to_pandas()
For more information, see the document from Apache pyarrow Reading and Writing Single Files
Upvotes: 18
Reputation: 2837
Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/
There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python
It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv
for example.
Upvotes: 22