Reputation: 1029
Need to remove non-printable characters from rdd.
Sample data is below
"@TSX•","None"
"@MJU•","None"
expected output
@TSX,None
@MJU,None
Tried below code but its not working
sqlContext.read.option("sep", ","). \
option("encoding", "ISO-8859-1"). \
option("mode", "PERMISSIVE").csv(<path>).rdd.map(lambda s: s.replace("\xe2",""))
Upvotes: 3
Views: 3674
Reputation: 41957
You can use textFile
function of sparkContext
and use string.printable
to remove all special characters from strings.
import string
sc.textFile(inputPath to csv file)\
.map(lambda x: ','.join([''.join(e for e in y if e in string.printable).strip('\"') for y in x.split(',')]))\
.saveAsTextFile(output path )
Explanation
For your input line "@TSX•","None"
for y in x.split(',')
splits the string line to ["@TSX•", "None"]
where y
represent each elements in the array while iterating
for e in y if e in string.printable
is checking each character in y
is printable or not
if printable then the characters are joined to form a string of printable characters
.strip('\"')
removes the preceding and ending inverted commas from the printable string
finally the list of Strings is converted to comma sepated string by ','.join([''.join(e for e in y if e in string.printable).strip('\"') for y in x.split(',')])
I hope the explanation is clear enough to understand
Upvotes: 1
Reputation: 43504
One option is to try to filter your text using string.printable
:
import string
sqlContext.read\
.option("sep", ",")\
.option("encoding", "ISO-8859-1")\
.option("mode", "PERMISSIVE")\
.csv(<path>)\
.rdd\
.map(lambda s: filter(lambda x: x in string.printable, s))
Example
import string
rdd = sc.parallelize(["TSX•,None","MJU•,None", "!@#ABC,*()XYZ"])
print(rdd.map(lambda s: filter(lambda x: x in string.printable, s)).collect())
#['TSX,None', 'MJU,None', '!@#ABC,*()XYZ']
References
Upvotes: 1