Reputation: 1897

How do I get schema / column names from parquet file?

I have a file stored in HDFS as part-m-00000.gz.parquet

I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the .parquet extension.

How do I get the schema / column names for this file?

Upvotes: 58

Answers (8)

ns15

Reputation: 8844

Parquet CLI: parquet-cli is a light weight alternative to parquet-tools.

pip install parquet-cli          //installs via pip
parq filename.parquet            //view meta data
parq filename.parquet --schema   //view the schema
parq filename.parquet --head 10  //view top n rows

This tool will provide basic info about the parquet file.

UPDATE (Alternatives):

If you wish to do this using a GUI tool then checkout this answer - View Parquet data and metadata using DBeaver
DuckDB CLI

DuckDB has CLI tool (prebuilt binaries for linux, windows, macOS) that can be used to query parquet data from command line.

PS C:\Users\nsuser\dev\standalone_executable_binaries> ./duckdb
Connected to a transient in-memory database.

Read Parquet Schema.

D DESCRIBE SELECT * FROM READ_PARQUET('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
OR
D SELECT * FROM PARQUET_SCHEMA('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
┌───────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│    column_name    │ column_type │ null │ key │ default │ extra │
├───────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ registration_dttm │ TIMESTAMP   │ YES  │     │         │       │
│ id                │ INTEGER     │ YES  │     │         │       │
│ first_name        │ VARCHAR     │ YES  │     │         │       │
│ salary            │ DOUBLE      │ YES  │     │         │       │
└───────────────────┴─────────────┴──────┴─────┴─────────┴───────┘

more on DuckDB described here.

Upvotes: 16

Urvishsinh Mahida

Reputation: 1460

You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files.

And for the same matter, the Parquet project provides parquet-tools to do tasks like which you are trying to do. Open and see the schema, data, metadata etc.

Check out the parquet-tool project parquet-tools

Also Cloudera which support and contributes heavily to Parquet, also has a nice page with examples on usage of parquet-tools. A example from that page for your use case is

parquet-tools schema part-m-00000.parquet

Checkout the Cloudera page. Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce

Upvotes: 57

Abhishek Agnihotri

Reputation: 317

If you are using R, the following wrapper function on functions existed in arrow library will work for you:

read_parquet_schema <- function (file, col_select = NULL, as_data_frame = TRUE, props = ParquetArrowReaderProperties$create(), 
                                 ...) 
{
  require(arrow)
  reader <- ParquetFileReader$create(file, props = props, ...)
  schema <- reader$GetSchema()
  names <- names(schema)
  return(names)
}

Example:

arrow::write_parquet(iris,"iris.parquet")
read_parquet_schema("iris.parquet")
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Upvotes: 1

Anxo

Reputation: 79

If you use Docker you can also run parquet-tools in a container:

docker run -ti -v C:\file.parquet:/tmp/file.parquet nathanhowell/parquet-tools schema /tmp/file.parquet

Upvotes: 6

Eugene

Reputation: 11095

Maybe it's capable to use a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details.

It supports complex data type like array, map, etc.

Upvotes: 3

Powers

Reputation: 19348

Apache Arrow makes it easy to get the Parquet metadata with a lot of different languages including C, C++, Rust, Go, Java, JavaScript, etc.

Here's how to get the schema with PyArrow (the Python Apache Arrow API):

import pyarrow.parquet as pq

table = pq.read_table(path)
table.schema # pa.schema([pa.field("movie", "string", False), pa.field("release_year", "int64", True)])

See here for more details about how to read metadata information from Parquet files with PyArrow.

You can also grab the schema of a Parquet file with Spark.

val df = spark.read.parquet('some_dir/')
df.schema // returns a StructType

StructType objects look like this:

StructType(
  StructField(number,IntegerType,true),
  StructField(word,StringType,true)
)

From the StructType object, you can infer the column name, data type, and nullable property that's in the Parquet metadata. The Spark approach isn't as clean as the Arrow approach.

Upvotes: 3

Eric

Reputation: 2739

If your Parquet files are located in HDFS or S3 like me, you can try something like the following:

HDFS

parquet-tools schema hdfs://<YOUR_NAME_NODE_IP>:8020/<YOUR_FILE_PATH>/<YOUR_FILE>.parquet

parquet-tools schema s3://<YOUR_BUCKET_PATH>/<YOUR_FILE>.parquet

Hope it helps.

Upvotes: 7

Daya Venkatesan

Reputation: 136

Since it is not a text file, you cannot do a "-text" on it. You can read it easily through Hive even if you do not have the parquet-tools installed, if you can load that file to a Hive table.

Upvotes: 1

How do I get schema / column names from parquet file?

Answers (8)

Related Questions