Ignacio Alorre
Ignacio Alorre

Reputation: 7605

Spark Dataframe - How to get a particular field from a struct type column

I have a data frame with a structure like this:

root
 |-- npaDetails: struct (nullable = true)
 |    |-- additionalInformation: struct (nullable = true)
 |    |-- npaStatus: struct (nullable = true)
 |    |-- npaDetails: struct (nullable = true)
 |-- npaHeaderData: struct (nullable = true)
 |    |-- npaNumber: string (nullable = true)
 |    |-- npaDownloadDate: string (nullable = true)     
 |    |-- npaDownloadTime: string (nullable = true) 

I want to retrieve all npaNumber from all the rows in the dataframe.

My approach was to iterate over all rows in the data frame to extract for each one the value stored in the column npaHeaderData, in the field npaNumber. So I code the following lines:

parquetFileDF.foreach { newRow =>  

  //To retrieve the second column
  val column = newRow.get(1)

  //The following line is not allowed
  //val npaNumber= column.getAs[String]("npaNumber")  

  println(column)

}

The content of column printed in each iteration looks like:

[207400956,27FEB17,09.30.00]

But column is of type Any and I am not able extract any of its fields. Can anyone tell what am I doing wrong or what approach should I follow instead of this?

Thanks

Upvotes: 3

Views: 9140

Answers (3)

Sampat Kumar
Sampat Kumar

Reputation: 502

You can do as below , which will avoid the [] ,while reading data from a data frame.

ids[DataFrame]: {id, name}

val idRDDs = ids.rdd.map(x => x.getAs[String](0))
for(id <- idRDDs){
     id.map(x => println(x))
 }

The above way will solve your issues.

Upvotes: 0

Prasad Khode
Prasad Khode

Reputation: 6739

you can call select() on dataframe which will give you a new dataframe with only specified column

var newDataFrame = dataFrame.select(dataFrame("npaHeaderData.npaNumber").as("npaNumber"))

Upvotes: 1

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

if you are looking to extract only npaNumber then you can do

parquetFileDF.select($"npaHeaderData.npaNumber".as("npaNumber"))

you should have a dataframe with npaNumber column only.

Upvotes: 6

Related Questions