Reputation: 1291

Remove consecutive spaces in RDD lines in Spark

My data set after a lot of programmatic clean up looks like this (showing partial data set here).

ABCD        A M@L                             79
BGDA        F D@L                             89

I'd like to convert this into the following for further Spark Dataframe operations

ABCD,A,M@L,79
BGDA,F,D@L,89


val reg = """/\s{2,}/"""
val cleanedRDD2 = cleanedRDD1.filter(x=> !reg.pattern.matcher(x).matches())

But this returns nothing. How do i find and replace empty strings with a delimiter? Thanks! rt

Upvotes: 3

Answers (2)

user2094436

Reputation: 1

If you want to use directly on RDD

rdd_nopunc = rdd.flatMap(lambda x: x.split()).filter(lambda x: x.replace("[,.!?:;]", ""))

Upvotes: 0

Wiktor Stribiżew

Reputation: 626870

It seems you just want to replace all the non-vertical whitespaces in your string data. I suggest using replaceAll (to replace all the occurrences of the texts that match the pattern) with [\t\p{Zs}]+ regex.

Here is just a sample code:

val s = "ABCD        A M@L                             79\nBGDA        F D@L                             89"
val reg = """[\t\p{Zs}]+"""
val cleanedRDD2 = s.replaceAll(reg, ",")
print(cleanedRDD2)
// =>  ABCD,A,M@L,79
//     BGDA,F,D@L,89

And here is the regex demo. The [\t\p{Zs}]+ matches 1 or more occurrences of a tab (\t) or any Unicode whitespace from the Space Separator category.

To modify the contents of the RDD, just use .map:

newRDD = yourRDD.map(elt => elt.replaceAll("""[\t\p{Zs}]+""", ","))

Upvotes: 1

Remove consecutive spaces in RDD lines in Spark

Answers (2)

Related Questions