user11751463
user11751463

Reputation:

Extracting expression fails on scala

I'm trying to extract the last set of numbers from expressions like these:

urn:fb:xyx:266730227 -> 266730227
urn:fb:pqr:(urn:fb:abc:6217401,10444030746) -> 10444030746

This code works on Hive,

SELECT
CAST(regexp_extract(value.project, '.*,(\d+)', 1) AS BIGINT) AS p_id,
CAST(regexp_extract(value.context, '(\d+)', 0) AS BIGINT) AS co_id,
CAST(regexp_extract(key.identity, '(\d+)', 0) AS BIGINT) AS can_id
FROM some.dataset LIMIT 10

However in Spark scala this throws the following error,

FAILED, exitCode: 15, (reason: User class threw exception: scala.StringContext$InvalidEscapeException: invalid escape '\d' not one of [\b, \t, \n, \f, \r, \\, \", \'] at index 94 in "
|SELECT
|  *,
|  CAST(regexp_extract(value.project, '.*,(\d+)', 1) AS BIGINT) AS p_id,
|  CAST(regexp_extract(value.context, '(\d+)', 0) AS BIGINT) AS co_id,
|  CAST(regexp_extract(key.identity, '(\d+)', 0) AS BIGINT) AS ca_id
|FROM ". Use \\ for literal \.

When I try doing what the error recommends,

SELECT
CAST(regexp_extract(value.project, '.*,(\\d+)', 1) AS BIGINT) AS p_id,
CAST(regexp_extract(value.context, '(\\d+)', 0) AS BIGINT) AS co_id,
CAST(regexp_extract(key.identity, '(\\d+)', 0) AS BIGINT) AS can_id
FROM some.dataset LIMIT 10

The result columns are all NULL.

Upvotes: 1

Views: 1118

Answers (2)

jxc
jxc

Reputation: 14008

It's all about how to escape characters inside the String literals. By default, string literals (including regex patterns) are not escaped and you will have to use backslash to manually escape the literal backslashes. for example:

val df = Seq("urn:fb:xyx:266730227", "urn:fb:pqr:(urn:fb:abc:6217401,10444030746)").toDF("value")
df.createOrReplaceTempView("tbl")

// pattern as a regular string literal
df.withColumn("p_id", regexp_extract($"value", "(\\d+)", 1)).show(false)
+-------------------------------------------+---------+
|value                                      |p_id     |
+-------------------------------------------+---------+
|urn:fb:xyx:266730227                       |266730227|
|urn:fb:pqr:(urn:fb:abc:6217401,10444030746)|6217401  |
+-------------------------------------------+---------+

You can skip this by using raw string or multi-Line string(as mentioned by @mazaneicha) to the pattern:

// pattern as a raw string, keep backslash as-is
df.withColumn("p_id", regexp_extract($"value", raw"(\d+)", 1)).show(false)
+-------------------------------------------+---------+
|value                                      |p_id     |
+-------------------------------------------+---------+
|urn:fb:xyx:266730227                       |266730227|
|urn:fb:pqr:(urn:fb:abc:6217401,10444030746)|6217401  |
+-------------------------------------------+---------+

// pattern as a multi-Line string where backslash is not escaped
df.withColumn("p_id", regexp_extract($"value", """(\d+)""", 1)).show(false)
+-------------------------------------------+---------+
|value                                      |p_id     |
+-------------------------------------------+---------+
|urn:fb:xyx:266730227                       |266730227|
|urn:fb:pqr:(urn:fb:abc:6217401,10444030746)|6217401  |
+-------------------------------------------+---------+

When backslashes and regexp_extract (regexp_replace, split, str_to_map etc.) are shown inside an SQL expression(which is basically a string), like expr(), df.selectExpr(), spark.sql(), df.filter(), df.where() etc., you will have to double-escape the backslashes, for example:

// regular string literals
spark.sql("select *, regexp_extract(value, '(\\\\d+)', 1) as p_id from tbl").show
// raw string to SQL expression
spark.sql(raw"select *, regexp_extract(value, '(\\d+)', 1) as p_id from tbl").show
// multi-Line string to SQL expression
spark.sql("""select *, regexp_extract(value, '(\\d+)', 1) as p_id from tbl""").show

df.withColumn("p_id", expr("regexp_extract(value, '(\\\\d+)', 1)")).show(false)
df.withColumn("p_id", expr(raw"regexp_extract(value, '(\\d+)', 1)")).show(false)
df.withColumn("p_id", expr("""regexp_extract(value, '(\\d+)', 1)""")).show(false)

df.filter("value rlike '\\\\d'").show
df.filter(raw"value rlike '\\d'").show
df.filter("""value rlike '\\d'""").show

Note: there is no syntax to handle Scala raw-string or multi-Line string inside SQL expressions. To do the same, for spark 2.0+, you can set spark.sql.parser.escapedStringLiterals=true (default is false, reference), below from link:

val ESCAPED_STRING_LITERALS = buildConf("spark.sql.parser.escapedStringLiterals") ..... .doc("When true, string literals (including regex patterns) remain escaped in our SQL " + "parser. The default is false since Spark 2.0. Setting it to true can restore the behavior " + "prior to Spark 2.0.") .version("2.2.1") .booleanConf .createWithDefault(false)

Examples:

spark.conf.set("spark.sql.parser.escapedStringLiterals", "true")

df.withColumn("p_id", expr("regexp_extract(value, '(\\d+)', 1)")).show(false)
df.withColumn("p_id", expr(raw"regexp_extract(value, '(\d+)', 1)")).show(false)
spark.sql("select *, regexp_extract(value, '(\\d+)', 1) as p_id from tbl").show
spark.sql("""select *, regexp_extract(value, '(\d+)', 1) as p_id from tbl""").show
df.filter(raw"value rlike '\d'").show

Side-note: the above discussion is about escaping the literal backslashes inside strings, in case you want actual escaped chars (i.e. newline \n, TAB \t, NUL char \0 or \u0000 etc), then no extra backslash is required, for example:

// merge multiple lines into one line
spark.sql("select *, regexp_replace(x,'\n+',',') as y from values ('1,2\n3\n\n4') as (x)").show
// split string into an array using NUL char or/and TAB
spark.sql("select *, split(x,'[\t\u0000]+') as y from values ('s\u0000x\ty\tz') as (x)").show

Upvotes: 3

mck
mck

Reputation: 42422

I have also been bugged by this many times - at the end I just gave up using any backslash in my regex:

SELECT
CAST(regexp_extract(value.project, '.*,([0-9]+)', 1) AS BIGINT) AS p_id,
CAST(regexp_extract(value.context, '([0-9]+)', 0) AS BIGINT) AS co_id,
CAST(regexp_extract(key.identity, '([0-9]+)', 0) AS BIGINT) AS can_id
FROM some.dataset LIMIT 10

Upvotes: 0

Related Questions