Reputation:
I'm trying to extract the last set of numbers from expressions like these:
urn:fb:xyx:266730227 -> 266730227
urn:fb:pqr:(urn:fb:abc:6217401,10444030746) -> 10444030746
This code works on Hive,
SELECT
CAST(regexp_extract(value.project, '.*,(\d+)', 1) AS BIGINT) AS p_id,
CAST(regexp_extract(value.context, '(\d+)', 0) AS BIGINT) AS co_id,
CAST(regexp_extract(key.identity, '(\d+)', 0) AS BIGINT) AS can_id
FROM some.dataset LIMIT 10
However in Spark scala this throws the following error,
FAILED, exitCode: 15, (reason: User class threw exception: scala.StringContext$InvalidEscapeException: invalid escape '\d' not one of [\b, \t, \n, \f, \r, \\, \", \'] at index 94 in "
|SELECT
| *,
| CAST(regexp_extract(value.project, '.*,(\d+)', 1) AS BIGINT) AS p_id,
| CAST(regexp_extract(value.context, '(\d+)', 0) AS BIGINT) AS co_id,
| CAST(regexp_extract(key.identity, '(\d+)', 0) AS BIGINT) AS ca_id
|FROM ". Use \\ for literal \.
When I try doing what the error recommends,
SELECT
CAST(regexp_extract(value.project, '.*,(\\d+)', 1) AS BIGINT) AS p_id,
CAST(regexp_extract(value.context, '(\\d+)', 0) AS BIGINT) AS co_id,
CAST(regexp_extract(key.identity, '(\\d+)', 0) AS BIGINT) AS can_id
FROM some.dataset LIMIT 10
The result columns are all NULL.
Upvotes: 1
Views: 1118
Reputation: 14008
It's all about how to escape characters inside the String literals. By default, string literals (including regex patterns) are not escaped and you will have to use backslash to manually escape the literal backslashes. for example:
val df = Seq("urn:fb:xyx:266730227", "urn:fb:pqr:(urn:fb:abc:6217401,10444030746)").toDF("value")
df.createOrReplaceTempView("tbl")
// pattern as a regular string literal
df.withColumn("p_id", regexp_extract($"value", "(\\d+)", 1)).show(false)
+-------------------------------------------+---------+
|value |p_id |
+-------------------------------------------+---------+
|urn:fb:xyx:266730227 |266730227|
|urn:fb:pqr:(urn:fb:abc:6217401,10444030746)|6217401 |
+-------------------------------------------+---------+
You can skip this by using raw string or multi-Line string(as mentioned by @mazaneicha) to the pattern:
// pattern as a raw string, keep backslash as-is
df.withColumn("p_id", regexp_extract($"value", raw"(\d+)", 1)).show(false)
+-------------------------------------------+---------+
|value |p_id |
+-------------------------------------------+---------+
|urn:fb:xyx:266730227 |266730227|
|urn:fb:pqr:(urn:fb:abc:6217401,10444030746)|6217401 |
+-------------------------------------------+---------+
// pattern as a multi-Line string where backslash is not escaped
df.withColumn("p_id", regexp_extract($"value", """(\d+)""", 1)).show(false)
+-------------------------------------------+---------+
|value |p_id |
+-------------------------------------------+---------+
|urn:fb:xyx:266730227 |266730227|
|urn:fb:pqr:(urn:fb:abc:6217401,10444030746)|6217401 |
+-------------------------------------------+---------+
When backslashes and regexp_extract (regexp_replace, split, str_to_map etc.) are shown inside an SQL expression(which is basically a string), like expr()
, df.selectExpr()
, spark.sql()
, df.filter()
, df.where()
etc., you will have to double-escape the backslashes, for example:
// regular string literals
spark.sql("select *, regexp_extract(value, '(\\\\d+)', 1) as p_id from tbl").show
// raw string to SQL expression
spark.sql(raw"select *, regexp_extract(value, '(\\d+)', 1) as p_id from tbl").show
// multi-Line string to SQL expression
spark.sql("""select *, regexp_extract(value, '(\\d+)', 1) as p_id from tbl""").show
df.withColumn("p_id", expr("regexp_extract(value, '(\\\\d+)', 1)")).show(false)
df.withColumn("p_id", expr(raw"regexp_extract(value, '(\\d+)', 1)")).show(false)
df.withColumn("p_id", expr("""regexp_extract(value, '(\\d+)', 1)""")).show(false)
df.filter("value rlike '\\\\d'").show
df.filter(raw"value rlike '\\d'").show
df.filter("""value rlike '\\d'""").show
Note: there is no syntax to handle Scala raw-string or multi-Line string inside SQL expressions. To do the same, for spark 2.0+, you can set spark.sql.parser.escapedStringLiterals=true
(default is false
,
reference), below from link:
val ESCAPED_STRING_LITERALS = buildConf("spark.sql.parser.escapedStringLiterals") ..... .doc("When true, string literals (including regex patterns) remain escaped in our SQL " + "parser. The default is false since Spark 2.0. Setting it to true can restore the behavior " + "prior to Spark 2.0.") .version("2.2.1") .booleanConf .createWithDefault(false)
Examples:
spark.conf.set("spark.sql.parser.escapedStringLiterals", "true")
df.withColumn("p_id", expr("regexp_extract(value, '(\\d+)', 1)")).show(false)
df.withColumn("p_id", expr(raw"regexp_extract(value, '(\d+)', 1)")).show(false)
spark.sql("select *, regexp_extract(value, '(\\d+)', 1) as p_id from tbl").show
spark.sql("""select *, regexp_extract(value, '(\d+)', 1) as p_id from tbl""").show
df.filter(raw"value rlike '\d'").show
Side-note: the above discussion is about escaping the literal backslashes inside strings, in case you want actual escaped chars (i.e. newline \n
, TAB \t
, NUL char \0
or \u0000
etc), then no extra backslash is required, for example:
// merge multiple lines into one line
spark.sql("select *, regexp_replace(x,'\n+',',') as y from values ('1,2\n3\n\n4') as (x)").show
// split string into an array using NUL char or/and TAB
spark.sql("select *, split(x,'[\t\u0000]+') as y from values ('s\u0000x\ty\tz') as (x)").show
Upvotes: 3
Reputation: 42422
I have also been bugged by this many times - at the end I just gave up using any backslash in my regex:
SELECT
CAST(regexp_extract(value.project, '.*,([0-9]+)', 1) AS BIGINT) AS p_id,
CAST(regexp_extract(value.context, '([0-9]+)', 0) AS BIGINT) AS co_id,
CAST(regexp_extract(key.identity, '([0-9]+)', 0) AS BIGINT) AS can_id
FROM some.dataset LIMIT 10
Upvotes: 0