LUZO
LUZO

Reputation: 1029

How to delete non-printable character in rdd using pyspark

Need to remove non-printable characters from rdd.

Sample data is below

"@TSX•","None"
"@MJU•","None"

expected output

@TSX,None
@MJU,None

Tried below code but its not working

sqlContext.read.option("sep", ","). \
                option("encoding", "ISO-8859-1"). \
                option("mode", "PERMISSIVE").csv(<path>).rdd.map(lambda s: s.replace("\xe2",""))

Upvotes: 3

Views: 3674

Answers (2)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

You can use textFile function of sparkContext and use string.printable to remove all special characters from strings.

import string
sc.textFile(inputPath to csv file)\
    .map(lambda x: ','.join([''.join(e for e in y if e in string.printable).strip('\"') for y in x.split(',')]))\
    .saveAsTextFile(output path )

Explanation

For your input line "@TSX•","None"
for y in x.split(',') splits the string line to ["@TSX•", "None"] where y represent each elements in the array while iterating
for e in y if e in string.printable is checking each character in y is printable or not
if printable then the characters are joined to form a string of printable characters
.strip('\"') removes the preceding and ending inverted commas from the printable string
finally the list of Strings is converted to comma sepated string by ','.join([''.join(e for e in y if e in string.printable).strip('\"') for y in x.split(',')])

I hope the explanation is clear enough to understand

Upvotes: 1

pault
pault

Reputation: 43504

One option is to try to filter your text using string.printable:

import string
sqlContext.read\
    .option("sep", ",")\
    .option("encoding", "ISO-8859-1")\
    .option("mode", "PERMISSIVE")\
    .csv(<path>)\
    .rdd\
    .map(lambda s: filter(lambda x: x in string.printable, s))

Example

import string
rdd = sc.parallelize(["TSX•,None","MJU•,None", "!@#ABC,*()XYZ"])

print(rdd.map(lambda s: filter(lambda x: x in string.printable, s)).collect())
#['TSX,None', 'MJU,None', '!@#ABC,*()XYZ']

References

Upvotes: 1

Related Questions