Reputation: 151
i have a parquet file on my hadoop cluster ,i want to capture the column names and their datatypes and write it on a textfile.how to get the column names and their datatypes of parquet file using pyspark.
Upvotes: 8
Views: 21353
Reputation: 51
Use dataframe.printSchema() - Prints out the schema in the tree format.
df.printSchema() root |-- age: integer (nullable = true) |-- name: string (nullable = true)
You can redirect the output of your program and capture that in a text file.
Upvotes: 3
Reputation: 330063
You can simply read the file and use schema
to access individual fields
:
sqlContext.read.parquet(path_to_parquet_file).schema.fields
Upvotes: 11