Reputation: 792
I'm trying to implement a join in Spark SQL using a LIKE condition.
The row I am performing the join on looks like this and is called 'revision':
Table A:
8NXDPVAE
Table B:
[4,8]NXD_V%
Performing the join on SQL server (A.revision LIKE B.revision
) works just fine, but when doing the same in Spark SQL, the join returns no rows (if using inner join) or null values for Table B (if using outer join).
This is the query I am running:
val joined = spark.sql("SELECT A.revision, B.revision FROM RAWDATA A LEFT JOIN TPTYPE B ON A.revision LIKE B.revision")
The plan looks like this:
== Physical Plan ==
BroadcastNestedLoopJoin BuildLeft, LeftOuter, revision#15 LIKE revision#282, false
:- BroadcastExchange IdentityBroadcastMode
: +- *Project [revision#15]
: +- *Scan JDBCRelation(RAWDATA) [revision#15] PushedFilters: [EqualTo(bulk_id,2016092419270100198)], ReadSchema: struct<revision>
+- *Scan JDBCRelation(TPTYPE) [revision#282] ReadSchema: struct<revision>
Is it possible to perform a LIKE join like this or am I way off?
Upvotes: 10
Views: 24158
Reputation: 89
Syntax for like in spark scala api:
dataframe.filter(col("columns_name").like("regex"))
Upvotes: -2
Reputation: 330453
You are only a little bit off. Spark SQL and Hive follow SQL standard conventions where LIKE
operator accepts only two special characters:
_
(underscore) - which matches an arbitrary character.%
(percent) - which matches an arbitrary sequence of characters.Square brackets have no special meaning and [4,8]
matches only a [4,8]
literal:
spark.sql("SELECT '[4,8]' LIKE '[4,8]'").show
+----------------+
|[4,8] LIKE [4,8]|
+----------------+
| true|
+----------------+
To match complex patterns you can use RLIKE
operator which suports Java regular expressions:
spark.sql("SELECT '8NXDPVAE' RLIKE '^[4,8]NXD.V.*$'").show
+-----------------------------+
|8NXDPVAE RLIKE ^[4,8]NXD.V.*$|
+-----------------------------+
| true|
+-----------------------------+
Upvotes: 13