How to split columns into two sets per type?

Question

I have a CSV input file. We read that using the following

val rawdata = spark.
  read.
  format("csv").
  option("header", true).
  option("inferSchema", true).
  load(filename)

This neatly reads the data and builds the schema.

The next step is to split the columns into String and Integer columns. How?

If the following is the schema of my dataset...

scala> rawdata.printSchema
root
 |-- ID: integer (nullable = true)
 |-- First Name: string (nullable = true)
 |-- Last Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- DailyRate: integer (nullable = true)
 |-- Dept: string (nullable = true)
 |-- DistanceFromHome: integer (nullable = true)

I'd like to split this into two variables (StringCols, IntCols) where:

StringCols should have "First Name","Last Name","Dept"
IntCols should have "ID","Age","DailyRate","DistanceFromHome"

This is what I have tried :

val names = rawdata.schema.fieldNames
val types = rawdata.schema.fields.map(r => r.dataType)

Now in types, I would like to loop and find all StringType and lookup up in names for the column name, similarly for IntegerType.

eliasah · Accepted Answer

Here you go, you can filter your columns by type using the underlying schema and the dataType

import org.apache.spark.sql.types.{IntegerType, StringType}

val stringCols = df.schema.filter(c => c.dataType == StringType).map(_.name)
val intCols = df.schema.filter(c => c.dataType == IntegerType).map(_.name)

val dfOfString = df.select(stringCols.head, stringCols.tail : _*)
val dfOfInt = df.select(intCols.head, intCols.tail : _*)

How to split columns into two sets per type?

Answers (2)

Related Questions