Reputation: 6773
I have data from SQL Server that I need to manipulate in Apache Spark (Databricks).
In SQL Server, three of this table's key columns use a case-sensitive COLLATION option, so that these specific columns are case-sensitive, but others in the table are not. These columns are short, alpha-numeric identifiers from a vendor application, and we must be able to use them in a case-sensitive manner in predicates and join conditions, while being able to use others in a case-insensitive manner.
The table was exported as CSV.
Is there a way to mix case-sensitive and case-insensitive column comparisons in a DataFrame, aside from using lower and upper functions? I can also import this CSV into a Databricks Table, but that does not seem to offer similar collation options at a metadata level for joins.
Thank you.
Upvotes: 4
Views: 4033
Reputation: 13528
There is no such option in Spark 2.x.
If you need the data's case preserved for downstream processing yet you also need case-insensitive comparison you have three options:
Create views with additional columns that are upper/lower-case projection. This is the best option if you are using SparkSQL exclusively. I would not recommend materializing additional columns for case-insensitive operations in persisted data: the extra I/O cost will likely outweigh the CPU cost for changing case.
If you are using the Spark DSL you can create abstractions that make your intent clearer without too much extra typing and without having to create new columns/views, e.g., introduce case-insensitive (in)equality and comparison operators if you are using Scala.
df.where('x ciLT 'y) // ciLT is case-insensitive less than, as an example
Depending on how much code you have to write, which operations require case-insensitive collation and the language you are using, you can get pretty fancy, e.g., automatically rewrite expressions such that (in)equality and comparison operators behave in a case sensitive or insensitive manner based on the table name. This requires advanced understanding of Spark's internals and only makes sense if you have to manage a large codebase with high code velocity.
If you primarily use case-insensitive operations in WHERE
and ORDER/SORT BY
the performance will be pretty good as data rows will not be widened for processing. Otherwise, the columns will be "materialized" in Spark's internal rows and there will be a small performance penalty.
If you do not need the data's case preserved for downstream processing, switching to a common case representation during ingestion is the right approach.
Upvotes: 3