Reputation: 457
I have a CSV input file. We read that using the following
val rawdata = spark.
read.
format("csv").
option("header", true).
option("inferSchema", true).
load(filename)
This neatly reads the data and builds the schema.
The next step is to split the columns into String and Integer columns. How?
If the following is the schema of my dataset...
scala> rawdata.printSchema
root
|-- ID: integer (nullable = true)
|-- First Name: string (nullable = true)
|-- Last Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- DailyRate: integer (nullable = true)
|-- Dept: string (nullable = true)
|-- DistanceFromHome: integer (nullable = true)
I'd like to split this into two variables (StringCols, IntCols) where:
This is what I have tried :
val names = rawdata.schema.fieldNames
val types = rawdata.schema.fields.map(r => r.dataType)
Now in types
, I would like to loop and find all StringType
and lookup up in names for the column name, similarly for IntegerType
.
Upvotes: 3
Views: 87
Reputation: 74619
Use dtypes operator:
dtypes: Array[(String, String)] Returns all column names and their data types as an array.
That would give you a more idiomatic way of dealing with schema of a dataset.
val rawdata = Seq(
(1, "First Name", "Last Name", 43, 2000, "Dept", 0)
).toDF("ID", "First Name", "Last Name", "Age", "DailyRate", "Dept", "DistanceFromHome")
scala> rawdata.dtypes.foreach(println)
(ID,IntegerType)
(First Name,StringType)
(Last Name,StringType)
(Age,IntegerType)
(DailyRate,IntegerType)
(Dept,StringType)
(DistanceFromHome,IntegerType)
I want to split this into two variables (StringCols, IntCols)
(I'd rather stick to using immutable values instead if you don't mind)
val emptyPair = (Seq.empty[String], Seq.empty[String])
val (stringCols, intCols) = rawdata.dtypes.foldLeft(emptyPair) { case ((strings, ints), (name: String, typ)) =>
typ match {
case _ if typ == "StringType" => (name +: strings, ints)
case _ if typ == "IntegerType" => (strings, name +: ints)
}
}
StringCols should have "First Name","Last Name","Dept" and IntCols should have "ID","Age","DailyRate","DistanceFromHome"
You can reverse
the collections, but I'd rather avoid doing it as performance expensive and giving you nothing in return.
Upvotes: 0
Reputation: 40360
Here you go, you can filter your columns by type using the underlying schema
and the dataType
import org.apache.spark.sql.types.{IntegerType, StringType}
val stringCols = df.schema.filter(c => c.dataType == StringType).map(_.name)
val intCols = df.schema.filter(c => c.dataType == IntegerType).map(_.name)
val dfOfString = df.select(stringCols.head, stringCols.tail : _*)
val dfOfInt = df.select(intCols.head, intCols.tail : _*)
Upvotes: 3