Reputation: 33
I want to get tokens from tweets.
To achieve this I use RegexTokenizer of Spark 2.0 and scala. My problem is to achieve the pattern I want.
I have these tweets:
0) "#oscars https://w.r/123f5"
1) "#oscars! go leo!"
2) "#oscars: did it!"
And I want to have the tokens:
0) (#oscars, https://w.r/123f5)
1) (#oscars, go, leo)
2) (#oscars, did, it)
That is, if the tweet has word "#oscar." or "#oscar!" or #oscar: ", I want the token to be:" #oscar " At the same time if the tweet has word "leo!" or "it" I want the token to be:"leo" or "it".
I don't want to disarm urls!
I try :
val sentenceDataFrame = spark.createDataFrame(Seq(
(0, "#oscars https://w.r/123f5"),
(1, "#oscars! go leo!"),
(2, "#oscars: he did it! ")
)).toDF("label", "sentence")
val regextokenizer = new RegexTokenizer()
.setGaps(false)
.setPattern("\\p{L}+")
.setInputCol("text")
.setOutputCol("words")
val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
But it doesn't works well. I get:
(oscars, https, w, r, 123f5)
(oscars, go, leo)
(oscars, he, did, it)
Upvotes: 0
Views: 755
Reputation: 626689
Inside setPattern
, use
"(?U)\\bhttps?://\\S*|#?\\b\\w+\\b
See the regex demo.
Details: the regex matches URLs with \\bhttps?://\\S*
and, with #?\\b\\w+\\b
, hashtags or words.
(?U)
- make \b
and \w
to be Unicode aware\\b
- a leading word boundaryhttps?
- http
or https
://
- ://
literal char sequence\\S*
- 0+ non-whitespace symbols|
- or #?
- 1 or 0 #
s\\b\\w+\\b
- a whole word, 1+ word chars (Unicode aware) within word boundaries.Upvotes: 1