How to delete non-printable character in rdd using pyspark

Question

Need to remove non-printable characters from rdd.

Sample data is below

"@TSX•","None"
"@MJU•","None"

expected output

@TSX,None
@MJU,None

Tried below code but its not working

sqlContext.read.option("sep", ","). \
                option("encoding", "ISO-8859-1"). \
                option("mode", "PERMISSIVE").csv().rdd.map(lambda s: s.replace("\xe2",""))

Ramesh Maharjan · Accepted Answer

You can use textFile function of sparkContext and use string.printable to remove all special characters from strings.

import string
sc.textFile(inputPath to csv file)\
    .map(lambda x: ','.join([''.join(e for e in y if e in string.printable).strip('\"') for y in x.split(',')]))\
    .saveAsTextFile(output path )

Explanation

For your input line "@TSX•","None"
for y in x.split(',') splits the string line to ["@TSX•", "None"] where y represent each elements in the array while iterating
for e in y if e in string.printable is checking each character in y is printable or not
if printable then the characters are joined to form a string of printable characters
.strip('\"') removes the preceding and ending inverted commas from the printable string
finally the list of Strings is converted to comma sepated string by ','.join([''.join(e for e in y if e in string.printable).strip('\"') for y in x.split(',')])

I hope the explanation is clear enough to understand

How to delete non-printable character in rdd using pyspark

Answers (2)

Related Questions