Vaibhav
Vaibhav

Reputation: 338

Remove spaces between single character in string

I was trying to remove duplicate words from a string in scala.

I wrote a udf(code below) to remove duplicate words from string:

val de_duplicate: UserDefinedFunction = udf ((value: String) => {
if(value == "" | value == null){""}
else {value.split("\\s+").distinct.mkString(" ")}
})

The problem I'm facing with this is that it is also removing single character tokens from the string,

For example if the string was:

"test abc abc 123 foo bar f f f"

The output I'm getting is:

"test abc 123 foo bar f"

What I want to do so remove only repeating words and not single characters, One workaround I could think of was to replace the spaces between any single character tokens in the string so that the example input string would become:

"test abc abc 123 foo bar fff"  

which would solve my problem, I can't figure out the proper regex pattern but I believe this could be done using capture group or look-ahead. I looked at similar questions for other languages but couldn't figure out the regex pattern in scala.

Any help on this would be appreciated!

Upvotes: 6

Views: 1249

Answers (2)

Pushpesh Kumar Rajwanshi
Pushpesh Kumar Rajwanshi

Reputation: 18357

You can use this regex to target duplicate words present in a string having length two or more characters and replace them with empty string to retain only unique words,

\b(\w{2,})\b\s*(?=.*\1)

Explanation:

  • \b(\w{2,})\b - Selects a word having at least two characters
  • \s* - This optional whitespace is there to remove any space present after the word, so unneeded space doesn't lie there
  • (?=.*\1) - This positive look ahead is the key here to target duplicate words and works by selecting a word if the same word is present ahead in the string

Regex Demo

Scala Code Demo

object Rextester extends App {
    val s = "abc test abc    abc 123 foo bar foo f sd foo f f abc"
    println("Unique words only: " + s.replaceAll("\\b(\\w{2,})\\b\\s*(?=.*\\1)",""))
 }

Outputs unique words only,

Unique words only: test 123 bar f sd foo f f abc

Edit:

As removing duplicate words is not what you wanted and you just wanted to remove one or more space between single character words, you can use this regex,

(?<=^|\b\w) +(?=\w\b|$)

and remove it with empty string,

Regex Demo

Scala Code,

val s = "test abc abc 123 foo bar f f f"
println("Val: " + s.replaceAll("(?<=^|\\b\\w) +(?=\\w\\b|$)",""))

Output,

Val: test abc abc 123 foo bar fff

Upvotes: 2

Allan
Allan

Reputation: 12438

If you want to remove spaces between single character in your input string, you can just use the following regex:

println("test abc abc 123 foo bar f f f".replaceAll("(?<= \\w|^\\w|^) (?=\\w |\\w$|$)", ""));

Output:

test abc abc 123 foo bar fff

Demo: https://regex101.com/r/tEKkeP/1

Explanations:

The regex: (?<= \w|^\w|^) (?=\w |\w$|$) will match spaces that are surrounded by one word character (with eventually a space before after it, or the beginning/end of line anchors) via positive lookahead/lookbehind closes.

More inputs:

test abc abc 123 foo bar f f f
f boo
 f boo
boo f
boo f f
too f 

Associated outputs:

test abc abc 123 foo bar fff
f boo
f boo
boo f
boo ff
too f

Upvotes: 7

Related Questions