melik
melik

Reputation: 1332

Pyspark how to remove punctuation marks and make lowercase letters in Rdd?

I would like to remove punctuation mark and make the lowercase letters in RDD? Below is my data set

 l=sc.parallelize(["How are you","Hello\ then% you"\
,"I think he's fine+ COMING"])

I tried below function but I got an error message

punc='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

def lower_clean_str(x):
    lowercased_str = x.lower()
    clean_str = lowercased_str.translate(punc) 
    return clean_str

one_RDD = l.flatMap(lambda x: lower_clean_str(x).split())
one_RDD.collect()

But this gives me an error. What might be the problem? How can I fix this? Thank you.

Upvotes: 5

Views: 11373

Answers (1)

Gaurang Shah
Gaurang Shah

Reputation: 12920

You are using the python translate function in a wrong way. As I am not sure if you are using python 2.7 or python 3, I am suggesting an alternate approach.

The translate function changes a bit in python 3.

The following code will work irrespective of the python version.

def lower_clean_str(x):
  punc='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
  lowercased_str = x.lower()
  for ch in punc:
    lowercased_str = lowercased_str.replace(ch, '')
  return lowercased_str

l=sc.parallelize(["How are you","Hello\ then% you","I think he's fine+ COMING"])
one_RDD = l.map(lower_clean_str)
one_RDD.collect()

Output :

['how are you', 'hello then you', 'i think hes fine coming']

Upvotes: 6

Related Questions