Reputation: 1291
My data set after a lot of programmatic clean up looks like this (showing partial data set here).
ABCD A M@L 79
BGDA F D@L 89
I'd like to convert this into the following for further Spark Dataframe operations
ABCD,A,M@L,79
BGDA,F,D@L,89
val reg = """/\s{2,}/"""
val cleanedRDD2 = cleanedRDD1.filter(x=> !reg.pattern.matcher(x).matches())
But this returns nothing. How do i find and replace empty strings with a delimiter? Thanks! rt
Upvotes: 3
Views: 3897
Reputation: 1
If you want to use directly on RDD
rdd_nopunc = rdd.flatMap(lambda x: x.split()).filter(lambda x: x.replace("[,.!?:;]", ""))
Upvotes: 0
Reputation: 626870
It seems you just want to replace all the non-vertical whitespaces in your string data. I suggest using replaceAll
(to replace all the occurrences of the texts that match the pattern) with [\t\p{Zs}]+
regex.
Here is just a sample code:
val s = "ABCD A M@L 79\nBGDA F D@L 89"
val reg = """[\t\p{Zs}]+"""
val cleanedRDD2 = s.replaceAll(reg, ",")
print(cleanedRDD2)
// => ABCD,A,M@L,79
// BGDA,F,D@L,89
And here is the regex demo. The [\t\p{Zs}]+
matches 1 or more occurrences of a tab (\t
) or any Unicode whitespace from the Space Separator category.
To modify the contents of the RDD, just use .map
:
newRDD = yourRDD.map(elt => elt.replaceAll("""[\t\p{Zs}]+""", ","))
Upvotes: 1