Reputation: 3727
We recently received a file to be ingested, the file is PSV format, however, all the fields are padded with extra characters $~$ on the left and right, so the entire PSV is like below:
$~$Field1$~$|$~$Field2$~$|$~$Field3$~$
$~$Data1$~$|$~$Data2$~$|$~$Data3$~$
$~$Data4$~$|$~$Data5$~$|$~$Data6$~$
$~$Data7$~$|$~$Data8$~$|$~$Data9$~$
$~$Data10$~$|$~$Data11$~$|$~$Data12$~$ .....
There are 100 Million rows in the file.
What would be the best way to trim these paddings so as to make it a standard PSV?
Thank you very much, any suggestion/sharing is appreciated here.
UPDATE:
The data is received from SFTP, and uploaded to Hadoop by IT Data Support (Unix Admin), we only have access to Hadoop cluster, but if it is an easy job for Data Support, maybe I can convince them to do the preprocess. Thanks.
Upvotes: 0
Views: 938
Reputation: 12542
tr
might be the faster solution. note, you can pipe any strings, so in this case, I'm cat
ing a file on disk, but this can also be a file stream from sftp.
~/Desktop/test $ cat data.txt
$~$Field1$~$|$~$Field2$~$|$~$Field3$~$
$~$Data1$~$|$~$Data2$~$|$~$Data3$~$
$~$Data4$~$|$~$Data5$~$|$~$Data6$~$
$~$Data7$~$|$~$Data8$~$|$~$Data9$~$
# the '>' will open a new file for writing
~/Desktop/test $ cat data.txt | tr -d \$~\$ > output.psv
# see the results here
~/Desktop/test $ cat output.psv
Field1|Field2|Field3
Data1|Data2|Data3
Data4|Data5|Data6
Data7|Data8|Data9
examples: https://shapeshed.com/unix-tr/#what-is-the-tr-command-in-unix
Upvotes: 0
Reputation: 8711
Using regexp_replace and foldLeft to update all columns. Check this out
scala> val df = Seq(("$~$Data1$~$","$~$Data2$~$","$~$Data3$~$"), ("$~$Data4$~$","$~$Data5$~$","$~$Data6$~$"), ("$~$Data7$~$","$~$Data8$~$","$~$Data9$~$"),("$~$Data10$~$","$~$Data11$~$","$~$Data12$~$")).toDF("Field1","field2","field3")
df: org.apache.spark.sql.DataFrame = [Field1: string, field2: string ... 1 more field]
scala> df.show(false)
+------------+------------+------------+
|Field1 |field2 |field3 |
+------------+------------+------------+
|$~$Data1$~$ |$~$Data2$~$ |$~$Data3$~$ |
|$~$Data4$~$ |$~$Data5$~$ |$~$Data6$~$ |
|$~$Data7$~$ |$~$Data8$~$ |$~$Data9$~$ |
|$~$Data10$~$|$~$Data11$~$|$~$Data12$~$|
+------------+------------+------------+
scala> val df2 = df.columns.foldLeft(df) { (acc,x) => acc.withColumn(x,regexp_replace(col(x),"""^\$~\$|\$~\$$""","")) }
df2: org.apache.spark.sql.DataFrame = [Field1: string, field2: string ... 1 more field]
scala> df2.show(false)
+------+------+------+
|Field1|field2|field3|
+------+------+------+
|Data1 |Data2 |Data3 |
|Data4 |Data5 |Data6 |
|Data7 |Data8 |Data9 |
|Data10|Data11|Data12|
+------+------+------+
scala>
Upvotes: 0
Reputation: 774
Here is a pure Spark solution. There might be better performing solutions.
var df = spark.read.option("delimiter", "|").csv(filePath)
val replace = (value: String, find: String, replace: String) => value.replace(find, replace)
val replaceUdf = udf(replace)
df.select(
df.columns.map(c => replaceUdf(col(c), lit("$~$"), lit("")).alias(c)): _*)
.show
Update: You cannot use $~$
as quote
option or use $~$|$~$
as a delimiter
in the 2.3.0 as those options accept only single character.
Upvotes: 0