Priya Chauhan
Priya Chauhan

Reputation: 485

Replace Newline character, Backspace character and carriage return character in pyspark dataframe

I have a file with 3 columns with data in every column. The data is in format below :

Every field is enclosed with backspaces like: BSC123BSC (here BSC is a backspace character). Columns' values contain new line and carriage return characters. Columns are delimited by Escape character.

I am not able to find the regex pattern to replace all three mentioned characters.

Upvotes: 0

Views: 5357

Answers (1)

mazaneicha
mazaneicha

Reputation: 9427

A simple [\b\n\r] should work.

>>> data = spark.createDataFrame([('086261636b73706163650d43520a4c4608',),('080d0a08',)],['hexstring']).withColumn('col',decode(unhex('hexstring'),'UTF-8')).drop('hexstring')
>>> cleansed = data.withColumn('regexed',regexp_replace('col','[\b\n\r]','*'))                                                              
>>> cleansed.select('regexed').show()
+-----------------+                                                             
|          regexed|
+-----------------+
|*backspace*CR*LF*|
|             ****|
+-----------------+

>>> 

Upvotes: 2

Related Questions