Reputation: 1897
I have a file stored in HDFS as part-m-00000.gz.parquet
I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet
but it's compressed, so I ran gunzip part-m-00000.gz.parquet
but it doesn't uncompress the file since it doesn't recognise the .parquet
extension.
How do I get the schema / column names for this file?
Upvotes: 58
Views: 144849
Reputation: 8684
Parquet CLI: parquet-cli is a light weight alternative to parquet-tools.
pip install parquet-cli //installs via pip
parq filename.parquet //view meta data
parq filename.parquet --schema //view the schema
parq filename.parquet --head 10 //view top n rows
This tool will provide basic info about the parquet file.
UPDATE (Alternatives):
DuckDB has CLI tool (prebuilt binaries for linux, windows, macOS) that can be used to query parquet data from command line.
PS C:\Users\nsuser\dev\standalone_executable_binaries> ./duckdb
Connected to a transient in-memory database.
Read Parquet Schema.
D DESCRIBE SELECT * FROM READ_PARQUET('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
OR
D SELECT * FROM PARQUET_SCHEMA('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
┌───────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
├───────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ registration_dttm │ TIMESTAMP │ YES │ │ │ │
│ id │ INTEGER │ YES │ │ │ │
│ first_name │ VARCHAR │ YES │ │ │ │
│ salary │ DOUBLE │ YES │ │ │ │
└───────────────────┴─────────────┴──────┴─────┴─────────┴───────┘
more on DuckDB described here.
Upvotes: 15
Reputation: 1450
You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files.
And for the same matter, the Parquet project provides parquet-tools to do tasks like which you are trying to do. Open and see the schema, data, metadata etc.
Check out the parquet-tool project parquet-tools
Also Cloudera which support and contributes heavily to Parquet, also has a nice page with examples on usage of parquet-tools. A example from that page for your use case is
parquet-tools schema part-m-00000.parquet
Checkout the Cloudera page. Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce
Upvotes: 56
Reputation: 317
If you are using R, the following wrapper function on functions existed in arrow library will work for you:
read_parquet_schema <- function (file, col_select = NULL, as_data_frame = TRUE, props = ParquetArrowReaderProperties$create(),
...)
{
require(arrow)
reader <- ParquetFileReader$create(file, props = props, ...)
schema <- reader$GetSchema()
names <- names(schema)
return(names)
}
Example:
arrow::write_parquet(iris,"iris.parquet")
read_parquet_schema("iris.parquet")
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Upvotes: 1
Reputation: 81
If you use Docker you can also run parquet-tools in a container:
docker run -ti -v C:\file.parquet:/tmp/file.parquet nathanhowell/parquet-tools schema /tmp/file.parquet
Upvotes: 6
Reputation: 11055
Maybe it's capable to use a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details.
It supports complex data type like array, map, etc.
Upvotes: 3
Reputation: 19308
Apache Arrow makes it easy to get the Parquet metadata with a lot of different languages including C, C++, Rust, Go, Java, JavaScript, etc.
Here's how to get the schema with PyArrow (the Python Apache Arrow API):
import pyarrow.parquet as pq
table = pq.read_table(path)
table.schema # pa.schema([pa.field("movie", "string", False), pa.field("release_year", "int64", True)])
See here for more details about how to read metadata information from Parquet files with PyArrow.
You can also grab the schema of a Parquet file with Spark.
val df = spark.read.parquet('some_dir/')
df.schema // returns a StructType
StructType objects look like this:
StructType(
StructField(number,IntegerType,true),
StructField(word,StringType,true)
)
From the StructType object, you can infer the column name, data type, and nullable property that's in the Parquet metadata. The Spark approach isn't as clean as the Arrow approach.
Upvotes: 3
Reputation: 2729
If your Parquet files are located in HDFS or S3 like me, you can try something like the following:
HDFS
parquet-tools schema hdfs://<YOUR_NAME_NODE_IP>:8020/<YOUR_FILE_PATH>/<YOUR_FILE>.parquet
S3
parquet-tools schema s3://<YOUR_BUCKET_PATH>/<YOUR_FILE>.parquet
Hope it helps.
Upvotes: 7
Reputation: 136
Since it is not a text file, you cannot do a "-text" on it. You can read it easily through Hive even if you do not have the parquet-tools installed, if you can load that file to a Hive table.
Upvotes: 1