Reputation: 73
When I run the below Scala code from the Spark 2.0 REPL (spark-shell), it runs as I intended it, splitting the string with a simple regular expression.
import org.apache.spark.sql.SparkSession
// Create session
val sparkSession = SparkSession.builder.master("local").getOrCreate()
// Use SparkSQL to split a string
val query = "SELECT split('What is this? A string I think', '\\\\?') AS result"
println("The query is: " + query)
val dataframe = sparkSession.sql(query)
// Show the result
dataframe.show(1, false)
giving the expected output
+---------------------------------+
|result |
+---------------------------------+
|[What is this, A string I think]|
+---------------------------------+
But I am confused about the need to escape the literal question mark with not a single, but double backslash (here represented as four backslashes since we must of course escape backslashes in Scala when not using triple-quoting).
I confirmed that some very similar code written by a colleague of mine for Spark 1.5 works just fine using a single (literal) backslash. But if I only use a single literal backslash in Spark 2.1, I get the error from the JVM's regex engine, "Dangling meta character '?' near index 0"
. I am aware this means the question mark was not escaped properly, but it smells like the backslash itself has to be escaped for first Scala and then SQL.
I'm guessing that this can be useful for inserting control characters (like newline) into the SQL query itself. I'm just confused if this has changed somewhere from Spark 1.5 to 2.1?
I have googled quite a bit for this, but didn't find anything. Either something has changed, or my colleague's code works in an unintended way.
I also tried this with Python/pyspark, and the same condition applies - double backslashes are needed in the SQL.
Can anyone explain this?
I'm running on a relatively simple setup on Windows, with Spark 2.1.0, JDK 1.8.0_111, and the Hadoop winutils.exe.
Upvotes: 7
Views: 12327
Reputation: 564
Please do not compare your Spark 2.1 behaviour with your colleague's Spark 1.5; when it comes to escape characters, they are expected to behave differently. Quoting from Spark docs:
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser.
and
There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing.
So check your settings by spark.conf.get('spark.sql.parser.escapedStringLiterals')
, and based on true/false use the single/double escape character.
Upvotes: 1
Reputation: 19308
Here are some different ways to get the same result:
Triple quotes
spark.sql("""SELECT split('What is this? A string I think', '\\?') AS result""").show(false)
Regex character escaping
spark.sql("""SELECT split('What is this? A string I think', '\\Q?\\E') AS result""").show(false)
Pattern.quote
Suppose your string was in a DataFrame.
val df = Seq(
("What is this? A string I think")
).toDF("your_string")
You could leverage the Java regex quoting function to split the string as follows:
import java.util.regex.Pattern
import org.apache.spark.sql.functions._
df
.withColumn("split_string", split($"your_string", Pattern.quote("?")))
.show(false)
Here's the output:
+------------------------------+---------------------------------+
|your_string |split_string |
+------------------------------+---------------------------------+
|What is this? A string I think|[What is this, A string I think]|
+------------------------------+---------------------------------+
See this post for more info.
Upvotes: 0
Reputation: 76
May be because backslash is a special symbol, used to concatenate multi-line SQLs.
sql_1 = spark.sql("SELECT \
1 AS `col1`, '{0}' AS `col2`".format(var_1))
Upvotes: 5