Pyspark how to remove punctuation marks and make lowercase letters in Rdd?

Question

I would like to remove punctuation mark and make the lowercase letters in RDD? Below is my data set

 l=sc.parallelize(["How are you","Hello\ then% you"\
,"I think he's fine+ COMING"])

I tried below function but I got an error message

punc='!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'

def lower_clean_str(x):
    lowercased_str = x.lower()
    clean_str = lowercased_str.translate(punc) 
    return clean_str

one_RDD = l.flatMap(lambda x: lower_clean_str(x).split())
one_RDD.collect()

But this gives me an error. What might be the problem? How can I fix this? Thank you.

Gaurang Shah · Accepted Answer

You are using the python translate function in a wrong way. As I am not sure if you are using python 2.7 or python 3, I am suggesting an alternate approach.

The translate function changes a bit in python 3.

The following code will work irrespective of the python version.

def lower_clean_str(x):
  punc='!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'
  lowercased_str = x.lower()
  for ch in punc:
    lowercased_str = lowercased_str.replace(ch, '')
  return lowercased_str

l=sc.parallelize(["How are you","Hello\ then% you","I think he's fine+ COMING"])
one_RDD = l.map(lower_clean_str)
one_RDD.collect()

Output :

['how are you', 'hello then you', 'i think hes fine coming']

Pyspark how to remove punctuation marks and make lowercase letters in Rdd?

Answers (1)

Related Questions