Reputation: 127
I am trying to check if there is any method to see if a particular column exists in a Dataframe, and check it using Java Spark. I searched and ended up with suggestions related to Python, but nothing related to Java.
i am extracting this data from Mongo and trying to check certain columns exist or not. There is no schema validation available in the mongo db for this table.
The following is my Schema and i would like to check if they exist with my config of columns.
|-- _id: string (nullable = true)
|-- value: struct (nullable = true)
| |-- acctId: string (nullable = true)
| |-- conId: string (nullable = true)
| |-- dimensions: struct (nullable = true)
| | |-- device: struct (nullable = true)
| | | |-- accountId: long (nullable = true)
| | | |-- addFreeTitleTime: timestamp (nullable = true)
| | | |-- build: string (nullable = true)
| | | |-- country: string (nullable = true)
| | | |-- countryOfResidence: string (nullable = true)
| | | |-- createDate: timestamp (nullable = true)
| | | |-- number: string (nullable = true)
| | | |-- FamilyName: string (nullable = true)
| | | |-- did: long (nullable = true)
| | | |-- deviceToken: string (nullable = true)
| | | |-- initialBuildNumber: string (nullable = true)
| | | |-- language: string (nullable = true)
| | | |-- major: integer (nullable = true)
| | | |-- minor: integer (nullable = true)
| | | |-- model: string (nullable = true)
| | | |-- modelDesc: string (nullable = true)
| | | |-- modelId: string (nullable = true)
| | | |-- modifyDate: timestamp (nullable = true)
| | | |-- preReg: integer (nullable = true)
| | | |-- retailer: string (nullable = true)
| | | |-- serialNumber: string (nullable = true)
| | | |-- softwareUpdateDate: timestamp (nullable = true)
| | | |-- softwareVersion: string (nullable = true)
| | | |-- sourceId: string (nullable = true)
| | | |-- timeZone: string (nullable = true)
| | |-- location: struct (nullable = true)
Your inputs and suggestions would be of great value.
Thanks in Advance
Upvotes: 0
Views: 1808
Reputation: 1214
sourceDF.printSchema
// root
// |-- category: string (nullable = true)
// |-- tags: string (nullable = true)
// |-- datetime: string (nullable = true)
// |-- date: string (nullable = true)
val cols = sourceDF.columns
// cols: Array[String] = Array(category, tags, datetime, date)
val IsFieldCategory = cols.filter(_ == "category")
// IsFieldCategory: Array[String] = Array(category)
or
val isFieldTags = sourceDF.columns.contains("tags")
// isFieldTags: Boolean = true
Upvotes: 2
Reputation: 787
Yes, you can achieve this in Java by fetching all the columns of a Dataset and checking if the column you want exists or not. Giving sample example here :
Dataset<Object1> dataSet = spark.read().text("dataPath").as(Encoders.bean(Object1.class)); //load data in dataset
String[] columns = dataSet.columns(); // fetch all column names
System.out.println(Arrays.toString(columns).contains("columnNameToCheckFor")); //check if the column name we want to check exist in the array of columns.
Here I have used a very naive method to check if the column name exist in the array of columns, you can use any other method to perform this check.
Upvotes: 2