Reputation: 1345
df1.printSchema()
prints out the column names and the data type that they possess.
df1.drop($"colName")
will drop columns by their name.
Is there a way to adapt this command to drop by the data-type instead?
Upvotes: 6
Views: 5445
Reputation: 11587
If you are looking to drop specific columns in the dataframe based on the types, then the below snippet would help. In this example, I have a dataframe with two columns of type String and Int respectivly. I am dropping my String (all fields of type String would be dropped) field from the schema based on its type.
import sqlContext.implicits._
val df = sc.parallelize(('a' to 'l').map(_.toString) zip (1 to 10)).toDF("c1","c2")
df.schema.fields
.collect({case x if x.dataType.typeName == "string" => x.name})
.foldLeft(df)({case(dframe,field) => dframe.drop(field)})
The schema of the newDf
is org.apache.spark.sql.DataFrame = [c2: int]
Upvotes: 9
Reputation: 477
Here is a fancy way in scala:
var categoricalFeatColNames = df.schema.fields filter { _.dataType.isInstanceOf[org.apache.spark.sql.types.StringType] } map { _.name }
Upvotes: 2