Sudeep Singh Thakur
Sudeep Singh Thakur

Reputation: 99

replace or remove new line "\n" character from Spark dataset column value

I have below code to read xml

Dataset<Row> dataset1 = SparkConfigXMLProcessor.sparkSession.read().format("com.databricks.spark.xml")
                .option("rowTag", properties.get(EventHubConsumerConstants.IG_ORDER_TAG).toString())
                .load(properties.get("C:\\inputOrders.xml").toString());

one of the column value getting new line character. i want to replace it with some character or just want to remove it. Please help

Upvotes: 4

Views: 19851

Answers (3)

Richard Haussmann
Richard Haussmann

Reputation: 73

This is what I used. I usually add a tab (\t), too. Having both \r and \n will find UNIX (\n), Windows (\r), and OSX (\r) newlines.

Dataset<Row> newDF = dataset1.withColumn("menuitemname", regexp_replace(col("menuitemname"), "\n|\r", ""));

Upvotes: 3

Yawar
Yawar

Reputation: 1046

dataset1.withColumn("menuitemname_clean", regexp_replace(col("menuitemname"), "[\n\r]", " "))

Above code will work

Upvotes: 8

Sudeep Singh Thakur
Sudeep Singh Thakur

Reputation: 99

Below code resolve my issue

Dataset<Row> newDF = dataset1.withColumn("menuitemname", regexp_replace(col("menuitemname"), "[\\n]", ""));

Upvotes: -3

Related Questions