Reputation: 382
I'm trying to get column names from a nested JSON via DataFrames. The schema is given below:
root
|-- body: struct (nullable = true)
| |-- Sw1: string (nullable = true)
| |-- Sw2: string (nullable = true)
| |-- Sw3: string (nullable = true)
| |-- Sw420: string (nullable = true)
|-- headers: struct (nullable = true)
| |-- endDate: string (nullable = true)
| |-- file: string (nullable = true)
| |-- startDate: string (nullable = true)
I can get the column names "body" and "header" with df.columns() but when I try to get the column names from the body (ex: Sw1, Sw2,...) with df.select("body").columns it always give me the body column.
Any suggestion? :)
Upvotes: 2
Views: 9737
Reputation: 11
If the nested json has an array of StructTypes, then the following code can be used (the below code is extension to the code given by Michael Armbrust)
import org.apache.spark.sql.types._
def findFields(path: String, dt: DataType): Unit = dt match {
case s: StructType =>
s.fields.foreach(f => findFields(path + "." + f.name, f.dataType))
case s: ArrayType =>
findFields(path, s.elementType)
case other =>
println(s"$path")
}
Upvotes: 1
Reputation: 1
To get the nested column names please use code like below :
From main method call like below:
findFields(df,df.schema)
Method:
def findFields(df:DataFrame,dt: DataType) =
{
val fieldName = dt.asInstanceOf[StructType].fields
for (value <- fieldName)
{
val colNames = value.productElement(1).asInstanceOf[StructType].fields
for (f <- colNames)
{
println("Inner Columns of "+value.name+" -->>"+f.name)
}
}
}
Note: This will work only when both first set of columns are struct type.
Upvotes: 0
Reputation: 1565
If the question is how to find the nested column names, you can do this by inspecting the schema
of the DataFrame. The schema is represented as a StructType
which can fields of other DataType
objects (included other nested structs). If you want to discover all the fields you'll have to walk this tree recursively. For example:
import org.apache.spark.sql.types._
def findFields(path: String, dt: DataType): Unit = dt match {
case s: StructType =>
s.fields.foreach(f => findFields(path + "." + f.name, f.dataType))
case other =>
println(s"$path: $other")
}
This walks the tree and prints out all the leaf fields and their type:
val df = sqlContext.read.json(sc.parallelize("""{"a": {"b": 1}}""" :: Nil))
findFields("", df.schema)
prints: .a.b: LongType
Upvotes: 7