Reputation: 338
I was trying to remove duplicate words from a string in scala.
I wrote a udf(code below) to remove duplicate words from string:
val de_duplicate: UserDefinedFunction = udf ((value: String) => {
if(value == "" | value == null){""}
else {value.split("\\s+").distinct.mkString(" ")}
})
The problem I'm facing with this is that it is also removing single character tokens from the string,
For example if the string was:
"test abc abc 123 foo bar f f f"
The output I'm getting is:
"test abc 123 foo bar f"
What I want to do so remove only repeating words and not single characters, One workaround I could think of was to replace the spaces between any single character tokens in the string so that the example input string would become:
"test abc abc 123 foo bar fff"
which would solve my problem, I can't figure out the proper regex pattern but I believe this could be done using capture group or look-ahead. I looked at similar questions for other languages but couldn't figure out the regex pattern in scala.
Any help on this would be appreciated!
Upvotes: 6
Views: 1249
Reputation: 18357
You can use this regex to target duplicate words present in a string having length two or more characters and replace them with empty string to retain only unique words,
\b(\w{2,})\b\s*(?=.*\1)
Explanation:
\b(\w{2,})\b
- Selects a word having at least two characters\s*
- This optional whitespace is there to remove any space present after the word, so unneeded space doesn't lie there(?=.*\1)
- This positive look ahead is the key here to target duplicate words and works by selecting a word if the same word is present ahead in the stringobject Rextester extends App {
val s = "abc test abc abc 123 foo bar foo f sd foo f f abc"
println("Unique words only: " + s.replaceAll("\\b(\\w{2,})\\b\\s*(?=.*\\1)",""))
}
Outputs unique words only,
Unique words only: test 123 bar f sd foo f f abc
Edit:
As removing duplicate words is not what you wanted and you just wanted to remove one or more space between single character words, you can use this regex,
(?<=^|\b\w) +(?=\w\b|$)
and remove it with empty string,
Scala Code,
val s = "test abc abc 123 foo bar f f f"
println("Val: " + s.replaceAll("(?<=^|\\b\\w) +(?=\\w\\b|$)",""))
Output,
Val: test abc abc 123 foo bar fff
Upvotes: 2
Reputation: 12438
If you want to remove spaces between single character in your input string, you can just use the following regex:
println("test abc abc 123 foo bar f f f".replaceAll("(?<= \\w|^\\w|^) (?=\\w |\\w$|$)", ""));
Output:
test abc abc 123 foo bar fff
Demo: https://regex101.com/r/tEKkeP/1
Explanations:
The regex: (?<= \w|^\w|^) (?=\w |\w$|$)
will match spaces that are surrounded by one word character (with eventually a space before after it, or the beginning/end of line anchors) via positive lookahead/lookbehind closes.
More inputs:
test abc abc 123 foo bar f f f
f boo
f boo
boo f
boo f f
too f
Associated outputs:
test abc abc 123 foo bar fff
f boo
f boo
boo f
boo ff
too f
Upvotes: 7