mdivk
mdivk

Reputation: 3727

How to trim fields when loading into dataframe in spark?

We recently received a file to be ingested, the file is PSV format, however, all the fields are padded with extra characters $~$ on the left and right, so the entire PSV is like below:

$~$Field1$~$|$~$Field2$~$|$~$Field3$~$

$~$Data1$~$|$~$Data2$~$|$~$Data3$~$

$~$Data4$~$|$~$Data5$~$|$~$Data6$~$

$~$Data7$~$|$~$Data8$~$|$~$Data9$~$

$~$Data10$~$|$~$Data11$~$|$~$Data12$~$ .....

There are 100 Million rows in the file.

What would be the best way to trim these paddings so as to make it a standard PSV?

Thank you very much, any suggestion/sharing is appreciated here.

UPDATE:

The data is received from SFTP, and uploaded to Hadoop by IT Data Support (Unix Admin), we only have access to Hadoop cluster, but if it is an easy job for Data Support, maybe I can convince them to do the preprocess. Thanks.

Upvotes: 0

Views: 938

Answers (3)

4m1r
4m1r

Reputation: 12542

tr might be the faster solution. note, you can pipe any strings, so in this case, I'm cating a file on disk, but this can also be a file stream from sftp.

~/Desktop/test $ cat data.txt
$~$Field1$~$|$~$Field2$~$|$~$Field3$~$

$~$Data1$~$|$~$Data2$~$|$~$Data3$~$

$~$Data4$~$|$~$Data5$~$|$~$Data6$~$

$~$Data7$~$|$~$Data8$~$|$~$Data9$~$

# the '>' will open a new file for writing

~/Desktop/test $ cat data.txt | tr -d \$~\$ > output.psv

# see the results here
~/Desktop/test $ cat output.psv 
Field1|Field2|Field3

Data1|Data2|Data3

Data4|Data5|Data6

Data7|Data8|Data9

examples: https://shapeshed.com/unix-tr/#what-is-the-tr-command-in-unix

Upvotes: 0

stack0114106
stack0114106

Reputation: 8711

Using regexp_replace and foldLeft to update all columns. Check this out

scala> val df = Seq(("$~$Data1$~$","$~$Data2$~$","$~$Data3$~$"), ("$~$Data4$~$","$~$Data5$~$","$~$Data6$~$"), ("$~$Data7$~$","$~$Data8$~$","$~$Data9$~$"),("$~$Data10$~$","$~$Data11$~$","$~$Data12$~$")).toDF("Field1","field2","field3")
df: org.apache.spark.sql.DataFrame = [Field1: string, field2: string ... 1 more field]

scala> df.show(false)
+------------+------------+------------+
|Field1      |field2      |field3      |
+------------+------------+------------+
|$~$Data1$~$ |$~$Data2$~$ |$~$Data3$~$ |
|$~$Data4$~$ |$~$Data5$~$ |$~$Data6$~$ |
|$~$Data7$~$ |$~$Data8$~$ |$~$Data9$~$ |
|$~$Data10$~$|$~$Data11$~$|$~$Data12$~$|
+------------+------------+------------+


scala> val df2 = df.columns.foldLeft(df) { (acc,x) => acc.withColumn(x,regexp_replace(col(x),"""^\$~\$|\$~\$$""","")) }
df2: org.apache.spark.sql.DataFrame = [Field1: string, field2: string ... 1 more field]

scala> df2.show(false)
+------+------+------+
|Field1|field2|field3|
+------+------+------+
|Data1 |Data2 |Data3 |
|Data4 |Data5 |Data6 |
|Data7 |Data8 |Data9 |
|Data10|Data11|Data12|
+------+------+------+


scala>

Upvotes: 0

Hedrack
Hedrack

Reputation: 774

Here is a pure Spark solution. There might be better performing solutions.

var df = spark.read.option("delimiter", "|").csv(filePath)
val replace = (value: String, find: String, replace: String) => value.replace(find, replace)
val replaceUdf = udf(replace)
df.select(
       df.columns.map(c => replaceUdf(col(c), lit("$~$"), lit("")).alias(c)): _*)
  .show

Update: You cannot use $~$ as quote option or use $~$|$~$ as a delimiter in the 2.3.0 as those options accept only single character.

Upvotes: 0

Related Questions