Reputation: 7605
I have a data frame with a structure like this:
root
|-- npaDetails: struct (nullable = true)
| |-- additionalInformation: struct (nullable = true)
| |-- npaStatus: struct (nullable = true)
| |-- npaDetails: struct (nullable = true)
|-- npaHeaderData: struct (nullable = true)
| |-- npaNumber: string (nullable = true)
| |-- npaDownloadDate: string (nullable = true)
| |-- npaDownloadTime: string (nullable = true)
I want to retrieve all npaNumber
from all the rows in the dataframe.
My approach was to iterate over all rows in the data frame to extract for each one the value stored in the column npaHeaderData
, in the field npaNumber
. So I code the following lines:
parquetFileDF.foreach { newRow =>
//To retrieve the second column
val column = newRow.get(1)
//The following line is not allowed
//val npaNumber= column.getAs[String]("npaNumber")
println(column)
}
The content of column printed in each iteration looks like:
[207400956,27FEB17,09.30.00]
But column
is of type Any and I am not able extract any of its fields. Can anyone tell what am I doing wrong or what approach should I follow instead of this?
Thanks
Upvotes: 3
Views: 9140
Reputation: 502
You can do as below , which will avoid the [] ,while reading data from a data frame.
ids[DataFrame]: {id, name}
val idRDDs = ids.rdd.map(x => x.getAs[String](0))
for(id <- idRDDs){
id.map(x => println(x))
}
The above way will solve your issues.
Upvotes: 0
Reputation: 6739
you can call select()
on dataframe which will give you a new dataframe with only specified column
var newDataFrame = dataFrame.select(dataFrame("npaHeaderData.npaNumber").as("npaNumber"))
Upvotes: 1
Reputation: 41957
if you are looking to extract only npaNumber
then you can do
parquetFileDF.select($"npaHeaderData.npaNumber".as("npaNumber"))
you should have a dataframe
with npaNumber
column only.
Upvotes: 6