Reputation: 1332
I would like to remove punctuation mark and make the lowercase letters in RDD? Below is my data set
l=sc.parallelize(["How are you","Hello\ then% you"\
,"I think he's fine+ COMING"])
I tried below function but I got an error message
punc='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
def lower_clean_str(x):
lowercased_str = x.lower()
clean_str = lowercased_str.translate(punc)
return clean_str
one_RDD = l.flatMap(lambda x: lower_clean_str(x).split())
one_RDD.collect()
But this gives me an error. What might be the problem? How can I fix this? Thank you.
Upvotes: 5
Views: 11373
Reputation: 12920
You are using the python translate function in a wrong way. As I am not sure if you are using python 2.7 or python 3, I am suggesting an alternate approach.
The translate function changes a bit in python 3.
The following code will work irrespective of the python version.
def lower_clean_str(x):
punc='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
lowercased_str = x.lower()
for ch in punc:
lowercased_str = lowercased_str.replace(ch, '')
return lowercased_str
l=sc.parallelize(["How are you","Hello\ then% you","I think he's fine+ COMING"])
one_RDD = l.map(lower_clean_str)
one_RDD.collect()
Output :
['how are you', 'hello then you', 'i think hes fine coming']
Upvotes: 6