Reputation: 572
I have a Spark DataFrame on PySpark and I want to store its schema into another Spark DataFrame.
For example:
I have a sample DataFrame df
that looks like -
+---+-------------------+
| id| v|
+---+-------------------+
| 0| 0.4707538108432022|
| 0|0.39170676690905415|
| 0| 0.8249512619546295|
| 0| 0.3366111661094958|
| 0| 0.8974360488327017|
+---+-------------------+
I can look out at the schema of df
by doing -
df.printSchema()
root
|-- id: integer (nullable = true)
|-- v: double (nullable = false)
What I require is a DataFrame that displays above information on df
in two columns col_name
and dtype
.
Expected Output:
+---------+-------------------+
| col_name| dtype|
+---------+-------------------+
| id| integer|
| v| double|
+---------+-------------------+
How do I achieve this? I cannot find anything regarding this. Thanks.
Upvotes: 1
Views: 3302
Reputation: 43504
The simplest thing would be create a dataframe from df.dtypes
:
spark.createDataFrame(df.dtypes, ["col_name", "dtype"]).show()
#+--------+------+
#|col_name| dtype|
#+--------+------+
#| id| int|
#| v|double|
#+--------+------+
But if you wanted the dtype
column to be as shown in printSchema
, you could do so through df.schema
spark.createDataFrame(
[(d['name'], d['type']) for d in df.schema.jsonValue()['fields']],
["col_name", "dtype"]
).show()
#+--------+-------+
#|col_name| dtype|
#+--------+-------+
#| id|integer|
#| v| double|
#+--------+-------+
Upvotes: 1