Balaji Krishnan
Balaji Krishnan

Reputation: 457

How to split columns into two sets per type?

I have a CSV input file. We read that using the following

val rawdata = spark.
  read.
  format("csv").
  option("header", true).
  option("inferSchema", true).
  load(filename)

This neatly reads the data and builds the schema.

The next step is to split the columns into String and Integer columns. How?

If the following is the schema of my dataset...

scala> rawdata.printSchema
root
 |-- ID: integer (nullable = true)
 |-- First Name: string (nullable = true)
 |-- Last Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- DailyRate: integer (nullable = true)
 |-- Dept: string (nullable = true)
 |-- DistanceFromHome: integer (nullable = true)

I'd like to split this into two variables (StringCols, IntCols) where:

This is what I have tried :

val names = rawdata.schema.fieldNames
val types = rawdata.schema.fields.map(r => r.dataType)

Now in types, I would like to loop and find all StringType and lookup up in names for the column name, similarly for IntegerType.

Upvotes: 3

Views: 87

Answers (2)

Jacek Laskowski
Jacek Laskowski

Reputation: 74619

Use dtypes operator:

dtypes: Array[(String, String)] Returns all column names and their data types as an array.

That would give you a more idiomatic way of dealing with schema of a dataset.

val rawdata = Seq(
  (1, "First Name", "Last Name", 43, 2000, "Dept", 0)
).toDF("ID", "First Name", "Last Name", "Age", "DailyRate", "Dept", "DistanceFromHome")
scala> rawdata.dtypes.foreach(println)
(ID,IntegerType)
(First Name,StringType)
(Last Name,StringType)
(Age,IntegerType)
(DailyRate,IntegerType)
(Dept,StringType)
(DistanceFromHome,IntegerType)

I want to split this into two variables (StringCols, IntCols)

(I'd rather stick to using immutable values instead if you don't mind)

val emptyPair = (Seq.empty[String], Seq.empty[String])
val (stringCols, intCols) = rawdata.dtypes.foldLeft(emptyPair) { case ((strings, ints), (name: String, typ)) =>
  typ match {
    case _ if typ == "StringType" => (name +: strings, ints)
    case _ if typ == "IntegerType" => (strings, name +: ints)
  }
}

StringCols should have "First Name","Last Name","Dept" and IntCols should have "ID","Age","DailyRate","DistanceFromHome"

You can reverse the collections, but I'd rather avoid doing it as performance expensive and giving you nothing in return.

Upvotes: 0

eliasah
eliasah

Reputation: 40360

Here you go, you can filter your columns by type using the underlying schema and the dataType

import org.apache.spark.sql.types.{IntegerType, StringType}

val stringCols = df.schema.filter(c => c.dataType == StringType).map(_.name)
val intCols = df.schema.filter(c => c.dataType == IntegerType).map(_.name)

val dfOfString = df.select(stringCols.head, stringCols.tail : _*)
val dfOfInt = df.select(intCols.head, intCols.tail : _*)

Upvotes: 3

Related Questions